Since you’re dealing with change (atomic updates and so forth) are we still 
just talking about a *format* or does the architecture also now include pieces 
that could be described as *servers* and *protocols*?

When I first spoke to the CarbonData folks I observed that they were 
maintaining a global index and therefore were something more than just a 
format. Are you taking parquet into the same territory? And if not, what 
assumptions did you make to keep things simple?

I hope I’m not being too negative. If you’re solving distributed metastore 
problems that would be huge. But it sounds a bit too good to be true. 

Julian

On 2017-12-06 16:20, Ryan Blue <[email protected]> wrote: 
> Hi everyone,
> 
> I mentioned in the sync-up this morning that I’d send out an introduction
> to the table format we’re working on, which we’re calling Iceberg.
> 
> For anyone that wasn’t around here’s the background: there are several
> problems with how we currently manage data files to make up a table in the
> Hadoop ecosystem. The one that came up today was that you can’t actually
> update a table atomically to, for example, rewrite a file and safely delete
> records. That’s because Hive tables track what files are currently visible
> by listing partition directories, and we don’t have (or want) transactions
> for changes in Hadoop file systems. This means that you can’t actually have
> isolated commits to a table and the result is that *query results from Hive
> tables can be wrong*, though rarely in practice.
> 
> The problems with current tables are caused primarily by keeping state
> about what files are in or not in a table in the file system. As I said,
> one problem is that there are no transactions but you also have to list
> directories to plan jobs (bad on S3) and rename files from a temporary
> location to a final location (really, really bad on S3).
> 
> To avoid these problems we’ve been building the Iceberg format that tracks
> tracks every file in a table instead of tracking directories. Iceberg
> maintains snapshots of all the files in a dataset and atomically swaps
> snapshots and other metadata to commit. There are a few benefits to doing
> it this way:
> 
>    - *Snapshot isolation*: Readers always use a consistent snapshot of the
>    table, without needing to hold a lock. All updates are atomic.
>    - *O(1) RPCs to plan*: Instead of listing O(n) directories in a table to
>    plan a job, reading a snapshot requires O(1) RPC calls
>    - *Distributed planning*: File pruning and predicate push-down is
>    distributed to jobs, removing the metastore bottleneck
>    - *Version history and rollback*: Table snapshots are kept around and it
>    is possible to roll back if a job has a bug and commits
>    - *Finer granularity partitioning*: Distributed planning and O(1) RPC
>    calls remove the current barriers to finer-grained partitioning
> 
> We’re also taking this opportunity to fix a few other problems:
> 
>    - Schema evolution: columns are tracked by ID to support add/drop/rename
>    - Types: a core set of types, thoroughly tested to work consistently
>    across all of the supported data formats
>    - Metrics: cost-based optimization metrics are kept in the snapshots
>    - Portable spec: tables should not be tied to Java and should have a
>    simple and clear specification for other implementers
> 
> We have the core library to track files done, along with most of a
> specification, and a Spark datasource (v2) that can read Iceberg tables.
> I’ll be working on the write path next and we plan to build a Presto
> implementation soon.
> 
> I think this should be useful to others and it would be great to
> collaborate with anyone that is interested.
> 
> rb
> ​
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 

Reply via email to