[DISCUSS] Implementation strategies for supporting Iceberg tables in Hive

Adrien Guillo Wed, 24 Jul 2019 13:53:41 -0700

Hi Iceberg folks,

In the last few months, we (the data infrastructure team at Airbnb) have
been closely following the project. We are currently evaluating potential
strategies to migrate our data warehouse to Iceberg. However, we have a
very large Hive deployment, which means we can’t really do so without
support for Iceberg tables in Hive.


We have been thinking about implementation strategies. Here are some
thoughts that we would like to share them with you:

– Implementing a new `RawStore`

This is something that has been mentioned several times on the mailing list
and seems to indicate that adding support for Iceberg tables in Hive could
be achieved without client-side modifications. Does that mean that the
Metastore is the only process manipulating Iceberg metadata (snapshots,
manifests)? Does that mean that for instance the `listPartition*` calls to
the Metastore return the DataFiles associated with each partition? Per our
understanding, it seems that supporting Iceberg tables in Hive with this
strategy will most likely require to update the RawStore interface AND will
require at least some client-side changes. In addition, with this strategy
the Metastore bears new responsibilities, which contradicts one of the
Iceberg design goals: offloading more work to jobs and removing the
metastore as a bottleneck. In the Iceberg world, not much is needed from
the Metastore: it just keeps track of the metadata location and provides a
mechanism for atomically updating this location (basically, what is done in
the `HiveTableOperations` class). We would like to design a solution that
relies  as little as possible on the Metastore so that in future we have
the option to replace our fleet of Metastores with a simpler system.


– Implementing a new `HiveStorageHandler`

We are working on implementing custom `InputFormat` and `OutputFormat`
classes for Iceberg (more on that in the next paragraph) and they would fit
in nicely with the `HiveStorageHandler` and `HiveStoragePredicateHandler`
interfaces. However, the `HiveMetaHook` interface does not seem rich enough
to accommodate all the workflows, for instance no hooks run on `ALTER ...`
 or `INSERT...` commands.



– Proof of concept

We set out to write a proof of concept that would allow us to learn and
experiment. We based our work on the 2.3 branch. Here’s the state of the
project and the paths we explored:

DDL commands
We support some commands such as `CREABLE TABLE ...`, `DESC ...`, `SHOW
PARTITIONS`. They are all implemented in the client and mostly rely on the
`HiveCatalog` class to do the work.

Read path
We are in the process of implementing a custom `FileInputFormat` that
receives an Iceberg table identifier and a serialized expression
`ExprNodeDesc` as input. This is similar in a lot of ways to what you can
find in the `PigParquetReader` class in the `iceberg-pig` package or in
`HiveHBaseTableInputFormat` class in Hive.


Write path
We have made less progress in that front but we see a path forward by
implementing a custom `OutputFormat` that would keep track of the files
that are being written and gather statistics. Then, each task can dump this
information on HDFS. From there, the final Hive `MoveTask` can merge those
“pre-manifest” files to create a new snapshot and commit the new version of
a table.


We hope that our observations will start a healthy conversation about
supporting Iceberg tables in Hive :)


Cheers,
Adrien

[DISCUSS] Implementation strategies for supporting Iceberg tables in Hive

Reply via email to