Thanks for pointing out those references, suds! And thanks to Mouli (for writing the doc) and Anton (for writing the test)!
On Tue, Jan 28, 2020 at 2:05 PM suds <sudssf2...@gmail.com> wrote: > We have referred https://iceberg.incubator.apache.org/custom-catalog/ and > implemented atomic operation using dynamo optimistic locking. Iceberg > codebase has has excellent test case to validate custom implementation. > > https://github.com/apache/incubator-iceberg/blob/master/hive/src/test/java/org/apache/iceberg/hive/TestHiveTableConcurrency.java > > > On Tue, Jan 28, 2020 at 1:35 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Hi Gautam, >> >> Hadoop tables are not intended to be used when the file system doesn't >> support atomic rename because of the problems you describe. Atomic rename >> is a requirement for correctness in Hadoop tables. >> >> That is why we also have metastore tables, where some other atomic swap >> is used. I strongly recommend using a metastore-based solution when your >> underlying file system doesn't support atomic rename, like the Hive >> catalog. We've also made it easy to plug in your own metastore using the >> `BaseMetastore` classes. >> >> That said, if you have an idea to make Hadoop tables better, I'm all for >> getting it in. But version hint file aside, without atomic rename, two >> committers could still conflict and cause one of the commits to be dropped >> because the second one to create any particular version's metadata file may >> succeed. I don't see a way around this. >> >> If you don't want to use a metastore, then you could rely on a write lock >> provided by ZooKeeper or something similar. >> >> On Tue, Jan 28, 2020 at 12:22 PM Gautam <gautamkows...@gmail.com> wrote: >> >>> Hello Devs, >>> We are currently working on building out a high >>> write throughput pipeline with Iceberg where hundreds or thousands of >>> writers (and thousands of readers) could be accessing a table at any given >>> moment. We are facing the issue called out by [1]. According to Iceberg's >>> spec on write reliability [2], the writers depend on an atomic swap, which >>> if fails should retry. While this may be true there can be instances where >>> the current write has the latest table state but still fails to perform the >>> swap or even worse the Reader sees an inconsistency while the write is >>> being made. To my understanding, this stems from the fact that the current >>> code [3] that does the swap assumes that the underlying filesystem provides >>> an atomic rename api ( like hdfs et al) to the version hint file which >>> keeps track of the current version. If the filesystem does not provide this >>> then it fails with a fatal error. I think Iceberg should provide some >>> resiliency here in committing the version once it knows that the latest >>> table state is still valid and more importantly ensure the readers never >>> fail during commit. If we agree I can work on adding this into Iceberg. >>> >>> How are folks handling write/read consistency cases where the underlying >>> fs doesn't provide atomic apis for file overwrite/rename? We'v outlined >>> the details in the attached issue#758 [1] .. What do folks think? >>> >>> Cheers, >>> -Gautam. >>> >>> [1] - https://github.com/apache/incubator-iceberg/issues/758 >>> [2] - https://iceberg.incubator.apache.org/reliability/ >>> [3] - >>> https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L220 >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix