Just made it public a minute ago. The repo is here: https://github.com/Netflix/iceberg
It's built with gradle and requires a Spark 2.3.0-SNAPSHOT (for Datasource V2) and Parquet 1.9.1-SNAPSHOT (for API additions and bug fixes). An early version of the spec is available for comments here: https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit?usp=sharing Feedback is definitely welcome. rb On Wed, Jan 3, 2018 at 6:28 PM, Julien Le Dem <[email protected]> wrote: > Happy new year! > I'm interested as well. > Did you get to publish your code on github? > Thanks > > On Fri, Dec 8, 2017 at 8:42 AM, Ryan Blue <[email protected]> > wrote: > >> I'm working on getting the code out to our open source github org, >> probably >> early next week. I'll set up a mailing list for it as well. >> >> rb >> >> On Thu, Dec 7, 2017 at 6:38 PM, Jacques Nadeau <[email protected]> >> wrote: >> >> > Sounds super interesting. Would love to collaborate on this. Do you >> have a >> > repo or mailing list where you are working on this? >> > >> > >> > >> > On Wed, Dec 6, 2017 at 4:20 PM, Ryan Blue <[email protected]> >> > wrote: >> > >> >> Hi everyone, >> >> >> >> I mentioned in the sync-up this morning that I’d send out an >> introduction >> >> to the table format we’re working on, which we’re calling Iceberg. >> >> >> >> For anyone that wasn’t around here’s the background: there are several >> >> problems with how we currently manage data files to make up a table in >> the >> >> Hadoop ecosystem. The one that came up today was that you can’t >> actually >> >> update a table atomically to, for example, rewrite a file and safely >> >> delete >> >> records. That’s because Hive tables track what files are currently >> visible >> >> by listing partition directories, and we don’t have (or want) >> transactions >> >> for changes in Hadoop file systems. This means that you can’t actually >> >> have >> >> isolated commits to a table and the result is that *query results from >> >> Hive >> >> tables can be wrong*, though rarely in practice. >> >> >> >> The problems with current tables are caused primarily by keeping state >> >> about what files are in or not in a table in the file system. As I >> said, >> >> one problem is that there are no transactions but you also have to list >> >> directories to plan jobs (bad on S3) and rename files from a temporary >> >> location to a final location (really, really bad on S3). >> >> >> >> To avoid these problems we’ve been building the Iceberg format that >> tracks >> >> tracks every file in a table instead of tracking directories. Iceberg >> >> maintains snapshots of all the files in a dataset and atomically swaps >> >> snapshots and other metadata to commit. There are a few benefits to >> doing >> >> it this way: >> >> >> >> - *Snapshot isolation*: Readers always use a consistent snapshot of >> the >> >> table, without needing to hold a lock. All updates are atomic. >> >> - *O(1) RPCs to plan*: Instead of listing O(n) directories in a >> table >> >> to >> >> plan a job, reading a snapshot requires O(1) RPC calls >> >> - *Distributed planning*: File pruning and predicate push-down is >> >> distributed to jobs, removing the metastore bottleneck >> >> - *Version history and rollback*: Table snapshots are kept around >> and >> >> it >> >> is possible to roll back if a job has a bug and commits >> >> - *Finer granularity partitioning*: Distributed planning and O(1) >> RPC >> >> calls remove the current barriers to finer-grained partitioning >> >> >> >> We’re also taking this opportunity to fix a few other problems: >> >> >> >> - Schema evolution: columns are tracked by ID to support >> >> add/drop/rename >> >> - Types: a core set of types, thoroughly tested to work consistently >> >> across all of the supported data formats >> >> - Metrics: cost-based optimization metrics are kept in the snapshots >> >> - Portable spec: tables should not be tied to Java and should have a >> >> simple and clear specification for other implementers >> >> >> >> We have the core library to track files done, along with most of a >> >> specification, and a Spark datasource (v2) that can read Iceberg >> tables. >> >> I’ll be working on the write path next and we plan to build a Presto >> >> implementation soon. >> >> >> >> I think this should be useful to others and it would be great to >> >> collaborate with anyone that is interested. >> >> >> >> rb >> >> >> >> -- >> >> Ryan Blue >> >> Software Engineer >> >> Netflix >> >> >> > >> > >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > -- Ryan Blue Software Engineer Netflix
