A few comments from me inline: > I think it is reasonable to make this a Spark job. The number of files in > tables we convert typically requires it. This would only be too much for the > driver if all of the files are collected at one time. We commit 500,000 files > per batch, which seems to work well. That trades atomicity, though.
Makes sense to me, just wanted to double-check that the “proper” API should still be based on a Spark job. > - What should be the scope? Do we want the migration to modify the catalog? > > What do you mean modify the catalog? > > We've built two SQL commands using these helper classes and methods: SNAPSHOT > TABLE that creates a new table copy, and MIGRATE TABLE that basically builds > a snapshot, but also renames the new table into place. By modifying catalog I mean creating/dropping tables in it. My question is whether this should be part of the migration tool/API or not. Right now, it isn’t. Instead, the catalog is modified by SNAPSHOT TABLE and MIGRATE TABLE commands that are using SparkTableUtil. I think is reasonable to keep it this way, meaning that the scope of the migration tool is to find files and append them to an existing Iceberg table. > - If we stick to having Dataset[DataFiles], we will have one append per > dataset partition. How do we want to guarantee that either all files are > appended or none? One way is to rely on temporary tables but this requires > looking into the catalog. Another approach is to say that if the migration > isn’t successful, users will need to remove the metadata and retry. > > I don't recommend committing from tasks. If Spark runs speculative attempts, > you can get duplicate data. That’s exactly my point. In addition, if one task fails, then we end up with partial results. However, it seems to be the only possible way right now. We can mitigate the effects by having temporary tables but this doesn’t solve the problem completely. > Another option is to add a way to append manifest files, not just data files. > That way, each task produces a manifest and we can create a single commit to > add all of the manifests. This requires some rewriting, but I think it would > be a good way to distribute the majority of the work and still have an atomic > commit. This sounds promising but requires quite some changes to the public API. I am wondering if we can group all files per Dataset partition and leverage ReplacePartitions/OverwriteFiles APIs in Iceberg. That way, the migration won’t be atomic but idempotent. Thanks, Anton > On 15 May 2019, at 19:12, Ryan Blue <rb...@netflix.com> wrote: > > Replies inline: > > On Tue, May 14, 2019 at 3:21 AM Anton Okolnychyi <aokolnyc...@apple.com> > wrote: > I would like to resume this topic. How do we see the proper API for migration? > > I have a couple of questions in mind: > - Now, it is based on a Spark job. Do we want to keep it that way because the > number of files might be huge? Will it be too much for the driver? > > I think it is reasonable to make this a Spark job. The number of files in > tables we convert typically requires it. This would only be too much for the > driver if all of the files are collected at one time. We commit 500,000 files > per batch, which seems to work well. That trades atomicity, though. > > - What should be the scope? Do we want the migration to modify the catalog? > > What do you mean modify the catalog? > > We've built two SQL commands using these helper classes and methods: SNAPSHOT > TABLE that creates a new table copy, and MIGRATE TABLE that basically builds > a snapshot, but also renames the new table into place. > > - If we stick to having Dataset[DataFiles], we will have one append per > dataset partition. How do we want to guarantee that either all files are > appended or none? One way is to rely on temporary tables but this requires > looking into the catalog. Another approach is to say that if the migration > isn’t successful, users will need to remove the metadata and retry. > > I don't recommend committing from tasks. If Spark runs speculative attempts, > you can get duplicate data. > > Another option is to add a way to append manifest files, not just data files. > That way, each task produces a manifest and we can create a single commit to > add all of the manifests. This requires some rewriting, but I think it would > be a good way to distribute the majority of the work and still have an atomic > commit. > > -- > Ryan Blue > Software Engineer > Netflix