I'm working on getting the code out to our open source github org, probably
early next week. I'll set up a mailing list for it as well.

rb

On Thu, Dec 7, 2017 at 6:38 PM, Jacques Nadeau <[email protected]> wrote:

> Sounds super interesting. Would love to collaborate on this. Do you have a
> repo or mailing list where you are working on this?
>
>
>
> On Wed, Dec 6, 2017 at 4:20 PM, Ryan Blue <[email protected]>
> wrote:
>
>> Hi everyone,
>>
>> I mentioned in the sync-up this morning that I’d send out an introduction
>> to the table format we’re working on, which we’re calling Iceberg.
>>
>> For anyone that wasn’t around here’s the background: there are several
>> problems with how we currently manage data files to make up a table in the
>> Hadoop ecosystem. The one that came up today was that you can’t actually
>> update a table atomically to, for example, rewrite a file and safely
>> delete
>> records. That’s because Hive tables track what files are currently visible
>> by listing partition directories, and we don’t have (or want) transactions
>> for changes in Hadoop file systems. This means that you can’t actually
>> have
>> isolated commits to a table and the result is that *query results from
>> Hive
>> tables can be wrong*, though rarely in practice.
>>
>> The problems with current tables are caused primarily by keeping state
>> about what files are in or not in a table in the file system. As I said,
>> one problem is that there are no transactions but you also have to list
>> directories to plan jobs (bad on S3) and rename files from a temporary
>> location to a final location (really, really bad on S3).
>>
>> To avoid these problems we’ve been building the Iceberg format that tracks
>> tracks every file in a table instead of tracking directories. Iceberg
>> maintains snapshots of all the files in a dataset and atomically swaps
>> snapshots and other metadata to commit. There are a few benefits to doing
>> it this way:
>>
>>    - *Snapshot isolation*: Readers always use a consistent snapshot of the
>>    table, without needing to hold a lock. All updates are atomic.
>>    - *O(1) RPCs to plan*: Instead of listing O(n) directories in a table
>> to
>>    plan a job, reading a snapshot requires O(1) RPC calls
>>    - *Distributed planning*: File pruning and predicate push-down is
>>    distributed to jobs, removing the metastore bottleneck
>>    - *Version history and rollback*: Table snapshots are kept around and
>> it
>>    is possible to roll back if a job has a bug and commits
>>    - *Finer granularity partitioning*: Distributed planning and O(1) RPC
>>    calls remove the current barriers to finer-grained partitioning
>>
>> We’re also taking this opportunity to fix a few other problems:
>>
>>    - Schema evolution: columns are tracked by ID to support
>> add/drop/rename
>>    - Types: a core set of types, thoroughly tested to work consistently
>>    across all of the supported data formats
>>    - Metrics: cost-based optimization metrics are kept in the snapshots
>>    - Portable spec: tables should not be tied to Java and should have a
>>    simple and clear specification for other implementers
>>
>> We have the core library to track files done, along with most of a
>> specification, and a Spark datasource (v2) that can read Iceberg tables.
>> I’ll be working on the write path next and we plan to build a Presto
>> implementation soon.
>>
>> I think this should be useful to others and it would be great to
>> collaborate with anyone that is interested.
>>
>> rb
>> ​
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to