[
https://issues.apache.org/jira/browse/PARQUET-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654256#comment-17654256
]
Gang Wu commented on PARQUET-2221:
----------------------------------
IMHO, the specs is authoritative to the reader implementation to correctly read
Parquet files created by different writers. But it is writer implementer's
choice to fallback to any standard encoding. This is what the video coding
standard does (e.g. H.264/AVC and H.265/HEVC).
What's more, the writer implementation can even rewrite the dictionary page and
dictionary-encoded data pages to fallback encoding if fallback happens and
discard the dictionary-encoded pages, just like what Apache ORC does. Mixing
dictionary encoding and non-dictionary encoding in the same column chunk makes
the implementation of features like reading dictionary and predicate pushdown
much complicated.
cc [[email protected]]
> [Format] Encoding spec incorrect for dictionary fallback
> --------------------------------------------------------
>
> Key: PARQUET-2221
> URL: https://issues.apache.org/jira/browse/PARQUET-2221
> Project: Parquet
> Issue Type: Bug
> Components: parquet-format
> Reporter: Antoine Pitrou
> Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec for DICTIONARY_ENCODING states that:
> bq. If the dictionary grows too big, whether in size or number of distinct
> values, the encoding will fall back to the plain encoding.
> https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8
> However, the parquet-mr implementation was deliberately changed to a
> different fallback mechanism in
> https://issues.apache.org/jira/browse/PARQUET-52
> I'm assuming the parquet-mr implementation is authoritative here. But then
> the spec is incorrect and should be fixed to reflect expected behavior.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)