Parquet sync notes for 2018-12-18

Ryan Blue Fri, 21 Dec 2018 13:19:02 -0800

Hi everyone,

Here are the notes from Tuesday’s sync. Sorry they’re a bit late!


*Attendees and topics*:

   - Zoltan Ivanfi (Cloudera) - New release candidate
   - Anna Szonyi (Cloudera) - New release candidate
   - Gabor Szadovszky (Cloudera) - New release candidate
   - Gidon Gershinsky (IBM) - Please vote on the encryption spec!
   - Steven Moy (Yelp)
   - Deepak Majeti (Vertica)
   - Ryan Blue (Netflix) - Iceberg features useful for Parquet (if time)

*Discussion*:

New RC for 1.11.0

   - Please vote!
   - Zoltan sent a summary of the tests used to validate page skipping to
   the dev list, but will discuss in this sync also
   - ColumnIndexBuilder
      - Read and write path
      - Extensive unit tests, assert that min, max, nulls are correct
      - In-memory unit tests
   - ColumnIndexFilter
      - Returns row ranges, read path
      - Low-level test with in-memory data that filtering based on new
      indexes is correct
   - Writing and reading actual files - integration tests
      - Filters and asserts correctness
      - Random haystack, known needles
      - Randomly generated data and real values - uses a name column
      - Sparse enough that pages should be skipped
   - Validation of the index contract - values in min/max
      - Not polished or committed but passing
   - Hive, Implala, Spark tests
      - Found an issue with newly working Spark test but not a Parquet
      problem

Iceberg features useful to Parquet

   - Iceberg expression library
   
<https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/com/netflix/iceberg/expressions/Expressions.java>
      - Supports simple expression construction: equals(“id”, 34)
      - Binds
      
<https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/com/netflix/iceberg/expressions/Binder.java>
      expressions to per-file schema and coerces types (34 -> 34L for INT64
      columns)
      - Simpler implementations of row group stats and dictionary filters
      - Only dependency requirement is iceberg-api.
   - Iceberg schemas and generic records
   
<https://github.com/apache/incubator-iceberg/tree/master/data/src/main/java/com/netflix/iceberg/data>
      - Support ID-based column resolution for full schema evolution
      - Built-in support for high-level types
      - Simpler generic records using JVM-native primitive objects, List,
      and Map
   - Iceberg record construction
   
<https://github.com/apache/incubator-iceberg/blob/master/parquet/src/main/java/com/netflix/iceberg/parquet/ParquetReader.java>
      - 5% faster when reading Avro, 20% faster when writing Avro
      - Simpler to plug in new data models (Pig, Avro, Iceberg generics,
      Spark currently supported)
   - Iceberg file API
   
<https://github.com/apache/incubator-iceberg/blob/master/parquet/src/main/java/com/netflix/iceberg/parquet/Parquet.java>
      - No Hadoop dependency, could be used as 2.0 file API
      - Supports row group filtering with Iceberg expressions

-- 
Ryan Blue
Software Engineer
Netflix

Parquet sync notes for 2018-12-18

Reply via email to