Re: Write reliability in Iceberg

Ryan Blue Tue, 28 Jan 2020 14:56:22 -0800

Thanks for pointing out those references, suds!

And thanks to Mouli (for writing the doc) and Anton (for writing the test)!


On Tue, Jan 28, 2020 at 2:05 PM suds <sudssf2...@gmail.com> wrote:

> We have referred https://iceberg.incubator.apache.org/custom-catalog/ and
> implemented atomic operation using dynamo optimistic locking. Iceberg
> codebase has has excellent test case to validate custom implementation.
>
> https://github.com/apache/incubator-iceberg/blob/master/hive/src/test/java/org/apache/iceberg/hive/TestHiveTableConcurrency.java
>
>
> On Tue, Jan 28, 2020 at 1:35 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Hi Gautam,
>>
>> Hadoop tables are not intended to be used when the file system doesn't
>> support atomic rename because of the problems you describe. Atomic rename
>> is a requirement for correctness in Hadoop tables.
>>
>> That is why we also have metastore tables, where some other atomic swap
>> is used. I strongly recommend using a metastore-based solution when your
>> underlying file system doesn't support atomic rename, like the Hive
>> catalog. We've also made it easy to plug in your own metastore using the
>> `BaseMetastore` classes.
>>
>> That said, if you have an idea to make Hadoop tables better, I'm all for
>> getting it in. But version hint file aside, without atomic rename, two
>> committers could still conflict and cause one of the commits to be dropped
>> because the second one to create any particular version's metadata file may
>> succeed. I don't see a way around this.
>>
>> If you don't want to use a metastore, then you could rely on a write lock
>> provided by ZooKeeper or something similar.
>>
>> On Tue, Jan 28, 2020 at 12:22 PM Gautam <gautamkows...@gmail.com> wrote:
>>
>>> Hello Devs,
>>>                      We are currently working on building out a high
>>> write throughput pipeline with Iceberg where hundreds or thousands of
>>> writers (and thousands of readers) could be accessing a table at any given
>>> moment. We are facing the issue called out by [1]. According to Iceberg's
>>> spec on write reliability [2], the writers depend on an atomic swap, which
>>> if fails should retry. While this may be true there can be instances where
>>> the current write has the latest table state but still fails to perform the
>>> swap or even worse the Reader sees an inconsistency while the write is
>>> being made. To my understanding, this stems from the fact that the current
>>> code [3] that does the swap assumes that the underlying filesystem provides
>>> an atomic rename api ( like hdfs et al) to the version hint file which
>>> keeps track of the current version. If the filesystem does not provide this
>>> then it fails with a fatal error. I think Iceberg should provide some
>>> resiliency here in committing the version once it knows that the latest
>>> table state is still valid and more importantly ensure the readers never
>>> fail during commit. If we agree I can work on adding this into Iceberg.
>>>
>>> How are folks handling write/read consistency cases where the underlying
>>> fs doesn't provide atomic apis for file overwrite/rename?  We'v outlined
>>> the details in the attached issue#758 [1] .. What do folks think?
>>>
>>> Cheers,
>>> -Gautam.
>>>
>>> [1] - https://github.com/apache/incubator-iceberg/issues/758
>>> [2] - https://iceberg.incubator.apache.org/reliability/
>>> [3] -
>>> https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L220
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Write reliability in Iceberg

Reply via email to