Since youâre dealing with change (atomic updates and so forth) are we still just talking about a *format* or does the architecture also now include pieces that could be described as *servers* and *protocols*?
When I first spoke to the CarbonData folks I observed that they were maintaining a global index and therefore were something more than just a format. Are you taking parquet into the same territory? And if not, what assumptions did you make to keep things simple? I hope Iâm not being too negative. If youâre solving distributed metastore problems that would be huge. But it sounds a bit too good to be true. Julian On 2017-12-06 16:20, Ryan Blue <[email protected]> wrote: > Hi everyone, > > I mentioned in the sync-up this morning that Iâd send out an introduction > to the table format weâre working on, which weâre calling Iceberg. > > For anyone that wasnât around hereâs the background: there are several > problems with how we currently manage data files to make up a table in the > Hadoop ecosystem. The one that came up today was that you canât actually > update a table atomically to, for example, rewrite a file and safely delete > records. Thatâs because Hive tables track what files are currently visible > by listing partition directories, and we donât have (or want) transactions > for changes in Hadoop file systems. This means that you canât actually have > isolated commits to a table and the result is that *query results from Hive > tables can be wrong*, though rarely in practice. > > The problems with current tables are caused primarily by keeping state > about what files are in or not in a table in the file system. As I said, > one problem is that there are no transactions but you also have to list > directories to plan jobs (bad on S3) and rename files from a temporary > location to a final location (really, really bad on S3). > > To avoid these problems weâve been building the Iceberg format that tracks > tracks every file in a table instead of tracking directories. Iceberg > maintains snapshots of all the files in a dataset and atomically swaps > snapshots and other metadata to commit. There are a few benefits to doing > it this way: > > - *Snapshot isolation*: Readers always use a consistent snapshot of the > table, without needing to hold a lock. All updates are atomic. > - *O(1) RPCs to plan*: Instead of listing O(n) directories in a table to > plan a job, reading a snapshot requires O(1) RPC calls > - *Distributed planning*: File pruning and predicate push-down is > distributed to jobs, removing the metastore bottleneck > - *Version history and rollback*: Table snapshots are kept around and it > is possible to roll back if a job has a bug and commits > - *Finer granularity partitioning*: Distributed planning and O(1) RPC > calls remove the current barriers to finer-grained partitioning > > Weâre also taking this opportunity to fix a few other problems: > > - Schema evolution: columns are tracked by ID to support add/drop/rename > - Types: a core set of types, thoroughly tested to work consistently > across all of the supported data formats > - Metrics: cost-based optimization metrics are kept in the snapshots > - Portable spec: tables should not be tied to Java and should have a > simple and clear specification for other implementers > > We have the core library to track files done, along with most of a > specification, and a Spark datasource (v2) that can read Iceberg tables. > Iâll be working on the write path next and we plan to build a Presto > implementation soon. > > I think this should be useful to others and it would be great to > collaborate with anyone that is interested. > > rb > â > -- > Ryan Blue > Software Engineer > Netflix >
