[ https://issues.apache.org/jira/browse/PARQUET-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138033#comment-15138033 ]
Wes McKinney commented on PARQUET-505: -------------------------------------- [~mdeepak] I have looked at the Impala code that handles this, I presume there is corresponding code in parquet-mr that inspects more and more of the file until finding a complete page header. This patch will require a unit test fixture for {{SerializedPageReader}}; let me know if I can help with that. > Column reader: automatically handle large data pages > ---------------------------------------------------- > > Key: PARQUET-505 > URL: https://issues.apache.org/jira/browse/PARQUET-505 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Wes McKinney > Assignee: Deepak Majeti > > Currently, we are only supporting data pages whose headers are 64K or less > (see {{parquet/column/serialized-page.cc}}. Since page headers can > essentially be arbitrarily large (in pathological cases) because of the page > statistics, if deserializing the page header fails, we should attempt to read > a progressively larger amount of file data in effort to find the end of the > page header. > As part of this (and to make testing easier!), the maximum data page header > size should be configurable. We can write test cases by defining appropriate > Statistics structs to yield serialized page headers of whatever desired size. > On malformed files, we may run past the end of the file, in such cases we > should raise a reasonable exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)