[jira] [Commented] (PARQUET-792) Skip the storage of repetition level and definition level for all-null column

Li (JIRA) Tue, 06 Dec 2016 17:07:10 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15727282#comment-15727282
 ]


Li commented on PARQUET-792:
----------------------------

Thanks for your reply. I turned on the log in 
ColumnChunkPageWriteStore.writeToFileWriter and noticed one field which are 
null for all records took tens of KBs. There are thousands of such fields, so 
tens of MB are waste in one page file. Digging depper, I added log in  
ColumnWriterV1.writePage, and found the dataColumn.getBufferedSize equals 0, 
while repetitionLevelColumn.getBufferedSize and 
definitionLevelColumn.getBufferedSize both took tens of KBs. Then, I printed 
the r level and d level in ColumnWriterV1.writeNull, and noticed the r level 
and d level vary from 0 to 2. The count of r/d level is ten thousands. I think 
the encoding must keep the information of sequence of actual data(0, 1, 2), so 
cannot just be compressed to only 2-3 bytes.
I did further investigation. In ColumnWriterV1.writePage, if 
statistics.hasNonNullValue is true, I do not add compressedBytes to buf, then 
the final file size shrink to 1/10.
I will use the new parquet-cli tool to inspect the sizes and thanks for your 
advice.

> Skip the storage of repetition level and definition level for all-null column
> -----------------------------------------------------------------------------
>
>                 Key: PARQUET-792
>                 URL: https://issues.apache.org/jira/browse/PARQUET-792
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Li
>            Priority: Minor
>
> I have a very sparse protobuf message in my project, with thousands of fields.
> In practise, most of the fields are all null values in one page.
> But the repetition level and definition level takes lots of storage space.
> Can parquet skip the storage of r level and d level for such all-null columns 
> to save storage space?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-792) Skip the storage of repetition level and definition level for all-null column

Reply via email to