Thanks for the specification. A couple of questions:
1) what does this to parquet and not to any underlying store? 2) If above is not true, can we expose an interface to install any underlying file format? 3) if we are defining snapshots, can we allow MVCC on top of the snapshots? To elaborate on 3) I would like to see a full set transactional file Format present which allows us to be generic and performant. I would be interested in doing a specification for update in this format. Can you please share the repository link and some internale documents to Understand a bit more? Regards, Atri On 7 Dec 2017 05:50, "Ryan Blue" <[email protected]> wrote: Hi everyofor metne, I mentioned in the sync-up this morning that I’d send out an introduction to the table format we’re working on, which we’re calling Iceberg. For anyone that wasn’t around here’s the background: there are several problems with how we currently manage data files to make up a table in the Hadoop ecosystem. The one that came up today was that you can’t actually update a table atomically to, for example, rewrite a file and safely delete records. That’s because Hive tables track what files are currently visible by listing partition directories, and we don’t have (or want) transactions for changes in Hadoop file systems. This means that you can’t actually have isolated commits to a table and the result is that *query results from Hive tables can be wrong*, though rarely in practice. The problems with current tables are caused primarily by keeping state about what files are in or not in a table in the file system. As I said, one problem is that there are no transactions but you also have to list directories to plan jobs (bad on S3) and rename files from a temporary location to a final location (really, really bad on S3). To avoid these problems we’ve been building the Iceberg format that tracks tracks every file in a table instead of tracking directories. Iceberg maintains snapshots of all the files in a dataset and atomically swaps snapshots and other metadata to commit. There are a few benefits to doing it this way: - *Snapshot isolation*: Readers always use a consistent snapshot of the table, without needing to hold a lock. All updates are atomic. - *O(1) RPCs to plan*: Instead of listing O(n) directories in a table to plan a job, reading a snapshot requires O(1) RPC calls - *Distributed planning*: File pruning and predicate push-down is distributed to jobs, removing the metastore bottleneck - *Version history and rollback*: Table snapshots are kept around and it is possible to roll back if a job has a bug and commits - *Finer granularity partitioning*: Distributed planning and O(1) RPC calls remove the current barriers to finer-grained partitioning We’re also taking this opportunity to fix a few other problems: - Schema evolution: columns are tracked by ID to support add/drop/rename - Types: a core set of types, thoroughly tested to work consistently across all of the supported data formats - Metrics: cost-based optimization metrics are kept in the snapshots - Portable spec: tables should not be tied to Java and should have a simple and clear specification for other implementers We have the core library to track files done, along with most of a specification, and a Spark datasource (v2) that can read Iceberg tables. I’ll be working on the write path next and we plan to build a Presto implementation soon. I think this should be useful to others and it would be great to collaborate with anyone that is interested. rb -- Ryan Blue Software Engineer Netflix
