[
https://issues.apache.org/jira/browse/PARQUET-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15727282#comment-15727282
]
Li commented on PARQUET-792:
----------------------------
Thanks for your reply. I turned on the log in
ColumnChunkPageWriteStore.writeToFileWriter and noticed one field which are
null for all records took tens of KBs. There are thousands of such fields, so
tens of MB are waste in one page file. Digging depper, I added log in
ColumnWriterV1.writePage, and found the dataColumn.getBufferedSize equals 0,
while repetitionLevelColumn.getBufferedSize and
definitionLevelColumn.getBufferedSize both took tens of KBs. Then, I printed
the r level and d level in ColumnWriterV1.writeNull, and noticed the r level
and d level vary from 0 to 2. The count of r/d level is ten thousands. I think
the encoding must keep the information of sequence of actual data(0, 1, 2), so
cannot just be compressed to only 2-3 bytes.
I did further investigation. In ColumnWriterV1.writePage, if
statistics.hasNonNullValue is true, I do not add compressedBytes to buf, then
the final file size shrink to 1/10.
I will use the new parquet-cli tool to inspect the sizes and thanks for your
advice.
> Skip the storage of repetition level and definition level for all-null column
> -----------------------------------------------------------------------------
>
> Key: PARQUET-792
> URL: https://issues.apache.org/jira/browse/PARQUET-792
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Reporter: Li
> Priority: Minor
>
> I have a very sparse protobuf message in my project, with thousands of fields.
> In practise, most of the fields are all null values in one page.
> But the repetition level and definition level takes lots of storage space.
> Can parquet skip the storage of r level and d level for such all-null columns
> to save storage space?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)