Attendees/Agenda
Julien (Dremio):
- Parquet-format: arrow types parity.
- parquet-mr: Parquet-Arrow schema converter PR
Ryan (Netflix):
- present New Parquet cli
- Parquet sort order proposal
Gabor, Zoltan (Cloudera, file formats team):
- getting started
Uwe (Blue Yonder):
- parquet-cpp getting close to release
- type changes with arrow discussion
Parquet logical types:
- Julien proposed new logical types to bring parity with Arrow: Union,
Intervals types, Null, Half Precision floats
- TODO(Julien): add LogicalType doc for new types.
- Union:
- differentiate between null union and projecting another value using
the union itself optional fields.
- describe union type constraints.
- Null: type for things that are always null. For example data coming from
schema discovery on son with a field always null.
- Interval Type:
- uses actual SQL spec for interval units
- deprecate existing Interval logical type.
- Half precision float: punt on that for now.
- defined in Arrow metadata
- actually not implemented in arrow-cpp and arrow-java
- possibly add physical type for half precision types.
- add encodings? See Ryan’s PR for float encoding
- Uwe: TIMESTAMP_NANOS ?
- used in Pandas
- used in Hive (through loosely defined Parquet’s int96)
- debate wether we should support it or not.
- Possibly have an int64 or fixed length byte array to store it.
- TODO(Uwe): open a JIRA, Ryan comment
Parquet-cli:
- Ryan's new parquet-cli
- easier to try encodings.
- look at data.
- some code from the kite project in Apache 2.
Parquet sort order:
- current proposal: to have 2 separate min and max in stats block
- Ryan: to create a Pull Request.
- how to formally specify sort order (comparator/collation)
- standard database collations? Look into Calcite?
Parquet-cpp release?
- fix bugs.
- release JIRA.
next sync up in two weeks.
On Thu, Oct 27, 2016 at 9:59 AM, Julien Le Dem <[email protected]> wrote:
> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>
> --
> Julien
>
--
Julien