Hi everyone,

I mentioned in the sync-up this morning that I’d send out an introduction
to the table format we’re working on, which we’re calling Iceberg.

For anyone that wasn’t around here’s the background: there are several
problems with how we currently manage data files to make up a table in the
Hadoop ecosystem. The one that came up today was that you can’t actually
update a table atomically to, for example, rewrite a file and safely delete
records. That’s because Hive tables track what files are currently visible
by listing partition directories, and we don’t have (or want) transactions
for changes in Hadoop file systems. This means that you can’t actually have
isolated commits to a table and the result is that *query results from Hive
tables can be wrong*, though rarely in practice.

The problems with current tables are caused primarily by keeping state
about what files are in or not in a table in the file system. As I said,
one problem is that there are no transactions but you also have to list
directories to plan jobs (bad on S3) and rename files from a temporary
location to a final location (really, really bad on S3).

To avoid these problems we’ve been building the Iceberg format that tracks
tracks every file in a table instead of tracking directories. Iceberg
maintains snapshots of all the files in a dataset and atomically swaps
snapshots and other metadata to commit. There are a few benefits to doing
it this way:

   - *Snapshot isolation*: Readers always use a consistent snapshot of the
   table, without needing to hold a lock. All updates are atomic.
   - *O(1) RPCs to plan*: Instead of listing O(n) directories in a table to
   plan a job, reading a snapshot requires O(1) RPC calls
   - *Distributed planning*: File pruning and predicate push-down is
   distributed to jobs, removing the metastore bottleneck
   - *Version history and rollback*: Table snapshots are kept around and it
   is possible to roll back if a job has a bug and commits
   - *Finer granularity partitioning*: Distributed planning and O(1) RPC
   calls remove the current barriers to finer-grained partitioning

We’re also taking this opportunity to fix a few other problems:

   - Schema evolution: columns are tracked by ID to support add/drop/rename
   - Types: a core set of types, thoroughly tested to work consistently
   across all of the supported data formats
   - Metrics: cost-based optimization metrics are kept in the snapshots
   - Portable spec: tables should not be tied to Java and should have a
   simple and clear specification for other implementers

We have the core library to track files done, along with most of a
specification, and a Spark datasource (v2) that can read Iceberg tables.
I’ll be working on the write path next and we plan to build a Presto
implementation soon.

I think this should be useful to others and it would be great to
collaborate with anyone that is interested.

rb
​
-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to