Notes from the sync (Full room today!) Zoltan (Cloudera, Parquet) Cheng (Databricks, Parquet - Spark integration): Index discussion Ryan (Netflix): Order changes, Logical type - Timestamp Deepak (Vertica - Parquet): Timestamp, indexes Greg (Cloudera): Timestamp Lars (Cloudera, Impala): Min/Max #46, feedback on indices Marcel (Cloudera, Impala): Min/Max #46, Index pages QinHui (Criteo): Migration project from JSON to Parquet using Protobuffs. Problem related to this. Srinath (Databricks): Indexing Julien (Dremio): Min/Max, Index discussion
Min/max: https://github.com/apache/parquet-format/pull/46 - Discussed Forward compatibility requirements to have ColumnOrder as the gatekeeper to interpret min_value and max_value fields - have the signed field is redundant and unnecessary - Action: Ryan to update the PR for final review this week (everyone). Index: https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BFxf8U_Do5K2wSO4/edit# - 2 types of lookup structures. - SortColumnIndex: index of values on sorted columns. (just boundary values) (only for main sorting column) - (name should be changed as it applies even if the column is not sorted) - OffsetIndex: locate data pages by row number. SortColumnIndex is used to narrow down the pages to apply a filter on. OffsetIndex is used to find the select rows in the other columns (projected but not filtered on) - Lars and Marcel to make sure the doc is linked in the JIRA and the JIRA referred to in the title. - Action for everyone: Provide feedback before April 19. - After that create a PR in parquet-format (labelled experimental spec until a reference implementation is finalized). Timestamp: https://github.com/apache/parquet-format/pull/51 <https://github.com/apache/parquet-format/pull/51/files> - PR #51 replaces the current LogicalType enum with a better and forward compatible union based definition. - Action for everyone: Provide Feedback before April 19 Protobuf: - QinHui to propose JIRA/PR for saving field ids in schema for protobufs. - capture unknown fields for which we only know the ID On Wed, Apr 12, 2017 at 9:57 AM, Julien Le Dem <[email protected]> wrote: > Marcel and Lars' doc: > https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF > xf8U_Do5K2wSO4/edit#heading=h.ft5dh2chrcjb > > On Wed, Apr 12, 2017 at 9:51 AM, Julien Le Dem <[email protected]> wrote: > >> 10am PT today on google hangout: >> https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up >> >> -- >> Julien >> > > > > -- > Julien > -- Julien
