[ 
https://issues.apache.org/jira/browse/PARQUET-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-183:
----------------------------------

As this is a change in format, this should happen in major release. Hopefully 
2.0

> Special case empty columns to store 0 pages and no column chunks in the footer
> ------------------------------------------------------------------------------
>
>                 Key: PARQUET-183
>                 URL: https://issues.apache.org/jira/browse/PARQUET-183
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Julien Le Dem
>
> Currently when a column is empty, each row group will contain one page that 
> encodes repetition and definition levels for this row group. These will be as 
> many 0s as there are rows in the row group (stored in the row group 
> metadata). These values are encoded using RLE so it ends up being very small. 
> However in cases where there are a lot of columns in a very sparse dataset we 
> end up with a lot of empty column chunks (a column chunk is the data for a 
> given column in a given row group). The metadata could become much smaller by 
> omitting empty column chunks as the metadata of an empty column chunk can be 
> derived from the row count in the corresponding row group.
> I propose the following:
> When a column chunk is empty, do not write any page to it.
> Do not add the column chunk metadata in the footer for such empty columns.
> A column chunk is empty if when writing the row group to disk, there is only 
> one page and this page contains rl and dl that are only 0s. (completely empty 
> column).
> When reading the dataset:
>  - the column is present in the schema.
>  - if there's no column chunk in the footer for a given row group that means 
> we can just replace rls and dls with infinite streams of 0s.
>  - any stats information can be replaced by #rows count of nulls in predicate 
> push down.
> This will help in cases where we have huge schemas where actually a small 
> subset of columns are populated. The file data will now look like as if we 
> had declared only the schema for columns that actually have data in them. 
> Only the schema in the footer will mention those empty columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to