[jira] [Commented] (PARQUET-2221) [Format] Encoding spec incorrect for dictionary fallback

Gang Wu (Jira) Tue, 03 Jan 2023 18:47:07 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654256#comment-17654256
 ]


Gang Wu commented on PARQUET-2221:
----------------------------------

IMHO, the specs is authoritative to the reader implementation to correctly read 
Parquet files created by different writers. But it is writer implementer's 
choice to fallback to any standard encoding. This is what the video coding 
standard does (e.g. H.264/AVC and H.265/HEVC).

What's more, the writer implementation can even rewrite the dictionary page and 
dictionary-encoded data pages to fallback encoding if fallback happens and 
discard the dictionary-encoded pages, just like what Apache ORC does. Mixing 
dictionary encoding and non-dictionary encoding in the same column chunk makes 
the implementation of features like reading dictionary and predicate pushdown 
much complicated.

cc [[email protected]]

> [Format] Encoding spec incorrect for dictionary fallback
> --------------------------------------------------------
>
>                 Key: PARQUET-2221
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2221
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Antoine Pitrou
>            Priority: Critical
>             Fix For: format-2.10.0
>
>
> The spec for DICTIONARY_ENCODING states that:
> bq. If the dictionary grows too big, whether in size or number of distinct 
> values, the encoding will fall back to the plain encoding. 
> https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8
> However, the parquet-mr implementation was deliberately changed to a 
> different fallback mechanism in 
> https://issues.apache.org/jira/browse/PARQUET-52
> I'm assuming the parquet-mr implementation is authoritative here. But then 
> the spec is incorrect and should be fixed to reflect expected behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2221) [Format] Encoding spec incorrect for dictionary fallback

Reply via email to