Wes McKinney created PARQUET-505:
------------------------------------

             Summary: Column reader: automatically handle large data pages
                 Key: PARQUET-505
                 URL: https://issues.apache.org/jira/browse/PARQUET-505
             Project: Parquet
          Issue Type: Bug
          Components: parquet-cpp
            Reporter: Wes McKinney


Currently, we are only supporting data pages whose headers are 64K or less (see 
{{parquet/column/serialized-page.cc}}. Since page headers can essentially be 
arbitrarily large (in pathological cases) because of the page statistics, if 
deserializing the page header fails, we should attempt to read a progressively 
larger amount of file data in effort to find the end of the page header. 

As part of this (and to make testing easier!), the maximum data page header 
size should be configurable. We can write test cases by defining appropriate 
Statistics structs to yield serialized page headers of whatever desired size.

On malformed files, we may run past the end of the file, in such cases we 
should raise a reasonable exception. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to