Notes:
Attendees/Agenda:
Zoltan (Cloudera, file formats):
- timestamp types
Ryan (Netflix):
- timestamp types
- fix for sorting metadata (min-max)
Deepak (Vertica, parquet-cpp):
- timestamp
Emily (IBM Spark Technology center)
Greg (Cloudera):
- timestamp
Lars (Cloudera impala):
- min-max (https://github.com/apache/parquet-format/pull/46)
Marcel (Cl Impala):
- timestamp
- sorting/min max
- bloom filters
Julien (Dremio):
- sorting/min max
- timestamp.
- Timestamp (2 types):
- Floating Timestamp
- ambiguity to the TZ: year/month/day/microseconds is the data stored.
- timezone less
- same binary representation as current Timestamp. Different logical
annotation.
- how to store metadata. Same binary format w/wo.
- action: Ryan to propose a PR on parquet-format
- Timestamp with Timezone.
- stored in UTC
- client side conversion to UTC
- writer timezone should be stored in the metadata?
- need to clarify if time can be adjusted.
- Int96: to be deprecated
- int64 used instead with logical type.
- won’t fix int96 ordering. Instead use replacement type.
- Lars to update the JIRA (PARQUET-323)
- new binary format : int64 storing actual date (year month day) +
microseconds since midnight.
- Marcel to open a JIRA.
- Sorting:
- Ryan to update the the PR (
https://github.com/apache/parquet-format/pull/46)
- Bloom filter: (PARQUET-319, PARQUET-41)
- take analysis from original PR:
- https://github.com/apache/parquet-mr/pull/215
- https://github.com/apache/parquet-format/pull/28
- need to define metadata.
- C++ code reuse between parquet-cpp, impala, …
- impala team to discuss how they want to do that.
- store page level stats in footer (PARQUET-907)
- several options:
- Index Page: similar to an ISAM index. 1 per row group: if ordered
just maxes and offsets
- add optional field in footer metadata.
On Wed, Mar 8, 2017 at 10:29 AM, Julien Le Dem <[email protected]> wrote:
> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up
>
> --
> Julien
>
--
Julien