I've also created a google group for discussion: [email protected]
You can join the group here: https://groups.google.com/forum/#!forum/iceberg-devel rb On Thu, Jan 4, 2018 at 11:03 AM, Ryan Blue <[email protected]> wrote: > Just made it public a minute ago. The repo is here: https://github.com/ > Netflix/iceberg > > It's built with gradle and requires a Spark 2.3.0-SNAPSHOT (for Datasource > V2) and Parquet 1.9.1-SNAPSHOT (for API additions and bug fixes). > > An early version of the spec is available for comments here: > https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_ > Q8Qf0ctMyGBKslOswA/edit?usp=sharing > > Feedback is definitely welcome. > > rb > > On Wed, Jan 3, 2018 at 6:28 PM, Julien Le Dem <[email protected]> > wrote: > >> Happy new year! >> I'm interested as well. >> Did you get to publish your code on github? >> Thanks >> >> On Fri, Dec 8, 2017 at 8:42 AM, Ryan Blue <[email protected]> >> wrote: >> >>> I'm working on getting the code out to our open source github org, >>> probably >>> early next week. I'll set up a mailing list for it as well. >>> >>> rb >>> >>> On Thu, Dec 7, 2017 at 6:38 PM, Jacques Nadeau <[email protected]> >>> wrote: >>> >>> > Sounds super interesting. Would love to collaborate on this. Do you >>> have a >>> > repo or mailing list where you are working on this? >>> > >>> > >>> > >>> > On Wed, Dec 6, 2017 at 4:20 PM, Ryan Blue <[email protected]> >>> > wrote: >>> > >>> >> Hi everyone, >>> >> >>> >> I mentioned in the sync-up this morning that I’d send out an >>> introduction >>> >> to the table format we’re working on, which we’re calling Iceberg. >>> >> >>> >> For anyone that wasn’t around here’s the background: there are several >>> >> problems with how we currently manage data files to make up a table >>> in the >>> >> Hadoop ecosystem. The one that came up today was that you can’t >>> actually >>> >> update a table atomically to, for example, rewrite a file and safely >>> >> delete >>> >> records. That’s because Hive tables track what files are currently >>> visible >>> >> by listing partition directories, and we don’t have (or want) >>> transactions >>> >> for changes in Hadoop file systems. This means that you can’t actually >>> >> have >>> >> isolated commits to a table and the result is that *query results from >>> >> Hive >>> >> tables can be wrong*, though rarely in practice. >>> >> >>> >> The problems with current tables are caused primarily by keeping state >>> >> about what files are in or not in a table in the file system. As I >>> said, >>> >> one problem is that there are no transactions but you also have to >>> list >>> >> directories to plan jobs (bad on S3) and rename files from a temporary >>> >> location to a final location (really, really bad on S3). >>> >> >>> >> To avoid these problems we’ve been building the Iceberg format that >>> tracks >>> >> tracks every file in a table instead of tracking directories. Iceberg >>> >> maintains snapshots of all the files in a dataset and atomically swaps >>> >> snapshots and other metadata to commit. There are a few benefits to >>> doing >>> >> it this way: >>> >> >>> >> - *Snapshot isolation*: Readers always use a consistent snapshot >>> of the >>> >> table, without needing to hold a lock. All updates are atomic. >>> >> - *O(1) RPCs to plan*: Instead of listing O(n) directories in a >>> table >>> >> to >>> >> plan a job, reading a snapshot requires O(1) RPC calls >>> >> - *Distributed planning*: File pruning and predicate push-down is >>> >> distributed to jobs, removing the metastore bottleneck >>> >> - *Version history and rollback*: Table snapshots are kept around >>> and >>> >> it >>> >> is possible to roll back if a job has a bug and commits >>> >> - *Finer granularity partitioning*: Distributed planning and O(1) >>> RPC >>> >> calls remove the current barriers to finer-grained partitioning >>> >> >>> >> We’re also taking this opportunity to fix a few other problems: >>> >> >>> >> - Schema evolution: columns are tracked by ID to support >>> >> add/drop/rename >>> >> - Types: a core set of types, thoroughly tested to work >>> consistently >>> >> across all of the supported data formats >>> >> - Metrics: cost-based optimization metrics are kept in the >>> snapshots >>> >> - Portable spec: tables should not be tied to Java and should have >>> a >>> >> simple and clear specification for other implementers >>> >> >>> >> We have the core library to track files done, along with most of a >>> >> specification, and a Spark datasource (v2) that can read Iceberg >>> tables. >>> >> I’ll be working on the write path next and we plan to build a Presto >>> >> implementation soon. >>> >> >>> >> I think this should be useful to others and it would be great to >>> >> collaborate with anyone that is interested. >>> >> >>> >> rb >>> >> >>> >> -- >>> >> Ryan Blue >>> >> Software Engineer >>> >> Netflix >>> >> >>> > >>> > >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> >> > > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix
