[
https://issues.apache.org/jira/browse/PARQUET-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Le Dem updated PARQUET-183:
----------------------------------
As this is a change in format, this should happen in major release. Hopefully
2.0
> Special case empty columns to store 0 pages and no column chunks in the footer
> ------------------------------------------------------------------------------
>
> Key: PARQUET-183
> URL: https://issues.apache.org/jira/browse/PARQUET-183
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Reporter: Julien Le Dem
>
> Currently when a column is empty, each row group will contain one page that
> encodes repetition and definition levels for this row group. These will be as
> many 0s as there are rows in the row group (stored in the row group
> metadata). These values are encoded using RLE so it ends up being very small.
> However in cases where there are a lot of columns in a very sparse dataset we
> end up with a lot of empty column chunks (a column chunk is the data for a
> given column in a given row group). The metadata could become much smaller by
> omitting empty column chunks as the metadata of an empty column chunk can be
> derived from the row count in the corresponding row group.
> I propose the following:
> When a column chunk is empty, do not write any page to it.
> Do not add the column chunk metadata in the footer for such empty columns.
> A column chunk is empty if when writing the row group to disk, there is only
> one page and this page contains rl and dl that are only 0s. (completely empty
> column).
> When reading the dataset:
> - the column is present in the schema.
> - if there's no column chunk in the footer for a given row group that means
> we can just replace rls and dls with infinite streams of 0s.
> - any stats information can be replaced by #rows count of nulls in predicate
> push down.
> This will help in cases where we have huge schemas where actually a small
> subset of columns are populated. The file data will now look like as if we
> had declared only the schema for columns that actually have data in them.
> Only the schema in the footer will mention those empty columns.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)