[ 
https://issues.apache.org/jira/browse/PARQUET-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165422#comment-14165422
 ] 

Tongjie Chen commented on PARQUET-100:
--------------------------------------

This patch introduces a new split strategy 
"TaskSideFooterSamplingSplitStrategy" which only reads one footer per folder.  
This class is a  subclass of TaskSideMetadataSplitStrategy. They are 
essentially the same split strategy in terms of how to calculate split. 

For HCatalog use case,  the file schema and metadata should be the same across 
all file footers within each partition (folder), hence reading one per folder 
will be enough for this case. By default sampling strategy is turned off, hence 
it is backward compatible. 

This speed up pig client processing quite a bit.

To use this new strategy:

set parquet.task.side.metadata true;
set parquet.task.side.metadata.samplefooters true;

> provide an option in parquet-pig to avoid reading footers in client side
> ------------------------------------------------------------------------
>
>                 Key: PARQUET-100
>                 URL: https://issues.apache.org/jira/browse/PARQUET-100
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: parquet-mr_1.6.0
>            Reporter: Tongjie Chen
>
> Parquet Pig reads footer in client side, to calculate splits and retrieve 
> schema etc.
> In HCatalog environment, if there are large number of files generated by 
> Hive, Parquet-Pig will spend significant chunk of time processing those 
> footers in client side (before job is submitted to cluster).  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to