Hi everyone,
Here are the notes from Tuesday’s sync. Sorry they’re a bit late!
*Attendees and topics*:
- Zoltan Ivanfi (Cloudera) - New release candidate
- Anna Szonyi (Cloudera) - New release candidate
- Gabor Szadovszky (Cloudera) - New release candidate
- Gidon Gershinsky (IBM) - Please vote on the encryption spec!
- Steven Moy (Yelp)
- Deepak Majeti (Vertica)
- Ryan Blue (Netflix) - Iceberg features useful for Parquet (if time)
*Discussion*:
New RC for 1.11.0
- Please vote!
- Zoltan sent a summary of the tests used to validate page skipping to
the dev list, but will discuss in this sync also
- ColumnIndexBuilder
- Read and write path
- Extensive unit tests, assert that min, max, nulls are correct
- In-memory unit tests
- ColumnIndexFilter
- Returns row ranges, read path
- Low-level test with in-memory data that filtering based on new
indexes is correct
- Writing and reading actual files - integration tests
- Filters and asserts correctness
- Random haystack, known needles
- Randomly generated data and real values - uses a name column
- Sparse enough that pages should be skipped
- Validation of the index contract - values in min/max
- Not polished or committed but passing
- Hive, Implala, Spark tests
- Found an issue with newly working Spark test but not a Parquet
problem
Iceberg features useful to Parquet
- Iceberg expression library
<https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/com/netflix/iceberg/expressions/Expressions.java>
- Supports simple expression construction: equals(“id”, 34)
- Binds
<https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/com/netflix/iceberg/expressions/Binder.java>
expressions to per-file schema and coerces types (34 -> 34L for INT64
columns)
- Simpler implementations of row group stats and dictionary filters
- Only dependency requirement is iceberg-api.
- Iceberg schemas and generic records
<https://github.com/apache/incubator-iceberg/tree/master/data/src/main/java/com/netflix/iceberg/data>
- Support ID-based column resolution for full schema evolution
- Built-in support for high-level types
- Simpler generic records using JVM-native primitive objects, List,
and Map
- Iceberg record construction
<https://github.com/apache/incubator-iceberg/blob/master/parquet/src/main/java/com/netflix/iceberg/parquet/ParquetReader.java>
- 5% faster when reading Avro, 20% faster when writing Avro
- Simpler to plug in new data models (Pig, Avro, Iceberg generics,
Spark currently supported)
- Iceberg file API
<https://github.com/apache/incubator-iceberg/blob/master/parquet/src/main/java/com/netflix/iceberg/parquet/Parquet.java>
- No Hadoop dependency, could be used as 2.0 file API
- Supports row group filtering with Iceberg expressions
--
Ryan Blue
Software Engineer
Netflix