William Butler created PARQUET-2124:
---------------------------------------
Summary: Bad DCHECK For Intermixed Dictionary Encoding
Key: PARQUET-2124
URL: https://issues.apache.org/jira/browse/PARQUET-2124
Project: Parquet
Issue Type: Bug
Components: parquet-cpp
Reporter: William Butler
Assignee: William Butler
Parquet CPP has a DCHECK for a dictionary encoded page coming after a
non-dictionary encoded page. This is bad because the DCHECK can be triggered by
Parquet files that have a column that has a dictionary page, then a
non-dictionary encoded page, then a page of dictionary encoded values(indices).
Fuzzing found such a file. While this could be turned into an exception, I
don't see anything in the Parquet specification that prohibits such an
occurrence of pages.
This situation has brought up on the mailing list
before([https://lists.apache.org/thread/3bzymmbxvmzj12km7cjz1150ndvy9bos)] and
it seems like this is valid but nobody is doing it.
In the PR that added this
check([https://github.com/apache/parquet-cpp/pull/73)] it was noted that the
check is probably not needed.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)