[
https://issues.apache.org/jira/browse/PARQUET-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Deepak Majeti reassigned PARQUET-505:
-------------------------------------
Assignee: Deepak Majeti
> Column reader: automatically handle large data pages
> ----------------------------------------------------
>
> Key: PARQUET-505
> URL: https://issues.apache.org/jira/browse/PARQUET-505
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Reporter: Wes McKinney
> Assignee: Deepak Majeti
>
> Currently, we are only supporting data pages whose headers are 64K or less
> (see {{parquet/column/serialized-page.cc}}. Since page headers can
> essentially be arbitrarily large (in pathological cases) because of the page
> statistics, if deserializing the page header fails, we should attempt to read
> a progressively larger amount of file data in effort to find the end of the
> page header.
> As part of this (and to make testing easier!), the maximum data page header
> size should be configurable. We can write test cases by defining appropriate
> Statistics structs to yield serialized page headers of whatever desired size.
> On malformed files, we may run past the end of the file, in such cases we
> should raise a reasonable exception.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)