Okay, thanks for explaining. I understand now.

The Hadoop table implementation is the only place where rename is used, and
it requires a file system that supports atomic rename. If you're using an
object store like S3 or GCS, then you should be using the HMS
implementation or a custom catalog instead of Hadoop tables.

The difference between these is how Iceberg keeps track of the current root
metadata file. HMS tables store the metadata location as a table property
of a table in the Hive MetaStore, and use the table locking API to
coordinate updates. If you're using the Hive MetaStore, then this should
work out of the box.

If you are using an alternative metastore, then you just need to implement
a custom catalog that handles the atomic swap from one metadata location to
another. Mouli just added a guide for doing this here (thanks!):
http://iceberg.apache.org/custom-catalog/

That's where you'd plug in your preferred method for making an atomic
update. That could be locking with ZooKeeper, using a database transaction,
or some other method. You just need to provide a way to atomically swap
metadata file location strings, and a way to get the current location.

I hope that helps! In the end it should be easier, since the API for
plugging in already exists.

rb

On Fri, Sep 13, 2019 at 11:02 AM Dave Sugden
<[email protected]> wrote:

>
> On Fri, Sep 13, 2019 at 1:47 PM Ryan Blue <[email protected]>
> wrote:
>
>> Hi Dave,
>>
>> I'm sure we can get this working, but I'd like to understand what you're
>> trying to do a bit better.
>>
>> Why do you need atomic rename? Iceberg is set up to write data in place
>> and not move or rename files. Committing those files to a table is an
>> atomic operation instead. Everything should work with GCS without
>> modification as far as I know, unless you don't want to use the Hadoop
>> FileSystem APIs.
>>
>>
> There is no native atomic rename in GCS, it requires a move + delete. From
> the page https://iceberg.apache.org/spec/#mvcc-and-optimistic-concurrency
>  :
>
> "Tables do not require rename, except for tables that use atomic rename
> to implement the commit operation for new metadata files."
>
> This ^ is what we are addressing ^. ie. when a snapshot commit occurs and
> the tmp metadata file is renamed to the snapshot metadata file.
>
> From HadoopTableOperations.java L#248:
>
>  /**
>    * Renames the source file to destination, using the provided file
> system. If the rename failed,
>    * an attempt will be made to delete the source file.
>    *
>    * @param fs the filesystem used for the rename
>    * @param src the source file
>    * @param dst the destination file
>    */
>   private void renameToFinal(FileSystem fs, Path src, Path dst) {
>     try {
>       if (!*fs.rename*(src, dst)) {
>
>
> The above is called in commit, and AFAIK comes with the assumption that
> the FileSystem.rename() is atomic... ?
>
>
>
>> Keeping file appenders open using a write property or a table property
>> sounds like a good idea to me. I wouldn't want this to be the default for
>> batch writes, but I think it may make sense as an option for streaming
>> writes. I'd prefer to add these features to the existing streaming writer
>> instead of allowing users to use their own custom writer. Are there other
>> reasons to replace the writer instead of making this behavior configurable?
>>
>
> Nope, that was the only reason. That is fine then if this could be
> supported for streaming writes.
>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to