Julien Le Dem created PARQUET-183:
-------------------------------------

             Summary: Special case empty columns to store 0 pages and no column 
chunks in the footer
                 Key: PARQUET-183
                 URL: https://issues.apache.org/jira/browse/PARQUET-183
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-mr
            Reporter: Julien Le Dem


Currently when a column is empty, each row group will contain one page that 
encodes repetition and definition levels for this row group. These will be as 
many 0s as there are rows in the row group (stored in the row group metadata). 
These values are encoded using RLE so it ends up being very small. 
However in cases where there are a lot of columns in a very sparse dataset we 
end up with a lot of empty column chunks (a column chunk is the data for a 
given column in a given row group). The metadata could become much smaller by 
omitting empty column chunks as the metadata of an empty column chunk can be 
derived from the row count in the corresponding row group.

I propose the following:
When a column chunk is empty, do not write any page to it.
Do not add the column chunk metadata in the footer for such empty columns.
A column chunk is empty if when writing the row group to disk, there is only 
one page and this page contains rl and dl that are only 0s. (completely empty 
column).
When reading the dataset:
 - the column is present in the schema.
 - if there's no column chunk in the footer for a given row group that means we 
can just replace rls and dls with infinite streams of 0s.
 - any stats information can be replaced by #rows count of nulls in predicate 
push down.

This will help in cases where we have huge schemas where actually a small 
subset of columns are populated. The file data will now look like as if we had 
declared only the schema for columns that actually have data in them. Only the 
schema in the footer will mention those empty columns.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to