Re: Write reliability in Iceberg

Ryan Blue Tue, 28 Jan 2020 13:35:24 -0800

Hi Gautam,

Hadoop tables are not intended to be used when the file system doesn't
support atomic rename because of the problems you describe. Atomic rename
is a requirement for correctness in Hadoop tables.

That is why we also have metastore tables, where some other atomic swap is
used. I strongly recommend using a metastore-based solution when your
underlying file system doesn't support atomic rename, like the Hive
catalog. We've also made it easy to plug in your own metastore using the
`BaseMetastore` classes.

That said, if you have an idea to make Hadoop tables better, I'm all for
getting it in. But version hint file aside, without atomic rename, two
committers could still conflict and cause one of the commits to be dropped
because the second one to create any particular version's metadata file may
succeed. I don't see a way around this.

If you don't want to use a metastore, then you could rely on a write lock
provided by ZooKeeper or something similar.

On Tue, Jan 28, 2020 at 12:22 PM Gautam <gautamkows...@gmail.com> wrote:

> Hello Devs,
>                      We are currently working on building out a high write
> throughput pipeline with Iceberg where hundreds or thousands of writers
> (and thousands of readers) could be accessing a table at any given moment.
> We are facing the issue called out by [1]. According to Iceberg's spec on
> write reliability [2], the writers depend on an atomic swap, which if fails
> should retry. While this may be true there can be instances where the
> current write has the latest table state but still fails to perform the
> swap or even worse the Reader sees an inconsistency while the write is
> being made. To my understanding, this stems from the fact that the current
> code [3] that does the swap assumes that the underlying filesystem provides
> an atomic rename api ( like hdfs et al) to the version hint file which
> keeps track of the current version. If the filesystem does not provide this
> then it fails with a fatal error. I think Iceberg should provide some
> resiliency here in committing the version once it knows that the latest
> table state is still valid and more importantly ensure the readers never
> fail during commit. If we agree I can work on adding this into Iceberg.
>
> How are folks handling write/read consistency cases where the underlying
> fs doesn't provide atomic apis for file overwrite/rename?  We'v outlined
> the details in the attached issue#758 [1] .. What do folks think?
>
> Cheers,
> -Gautam.
>
> [1] - https://github.com/apache/incubator-iceberg/issues/758
> [2] - https://iceberg.incubator.apache.org/reliability/
> [3] -
> https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L220
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Write reliability in Iceberg

Reply via email to