[ 
https://issues.apache.org/jira/browse/PARQUET-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138033#comment-15138033
 ] 

Wes McKinney commented on PARQUET-505:
--------------------------------------

[~mdeepak] I have looked at the Impala code that handles this, I presume there 
is corresponding code in parquet-mr that inspects more and more of the file 
until finding a complete page header.

This patch will require a unit test fixture for {{SerializedPageReader}}; let 
me know if I can help with that. 

> Column reader: automatically handle large data pages
> ----------------------------------------------------
>
>                 Key: PARQUET-505
>                 URL: https://issues.apache.org/jira/browse/PARQUET-505
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: Wes McKinney
>            Assignee: Deepak Majeti
>
> Currently, we are only supporting data pages whose headers are 64K or less 
> (see {{parquet/column/serialized-page.cc}}. Since page headers can 
> essentially be arbitrarily large (in pathological cases) because of the page 
> statistics, if deserializing the page header fails, we should attempt to read 
> a progressively larger amount of file data in effort to find the end of the 
> page header. 
> As part of this (and to make testing easier!), the maximum data page header 
> size should be configurable. We can write test cases by defining appropriate 
> Statistics structs to yield serialized page headers of whatever desired size.
> On malformed files, we may run past the end of the file, in such cases we 
> should raise a reasonable exception. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to