[ 
https://issues.apache.org/jira/browse/PARQUET-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated PARQUET-128:
-------------------------------
    Description: 
The RecordReader implementation currently will read all the columns before 
applying the filter predicate and deciding whether to keep the row or discard 
it.
We can have a RecordReader which will only assemble the columns on which 
filters are applied (which are usually a few), then apply the filter and decide 
whether to keep the row or not , and then goes on to assemble the remaining 
columns or skip the remaining columns accordingly.

Also for applications like spark sql , the schema usually applied is a flat one 
with no repeating or nested columns. In such cases, its better to have a 
light-weight, faster RecordReader.

The performance improvement by this change is seen to be significant , and is 
better in case smaller number of rows are returned by filtering (which is 
usually the case) and there are many number of columns

  was:
The RecordReader implementation currently will read all the columns before 
applying the filter predicate and deciding whether to keep the row or discard 
it.
We can have a RecordReader which will only assemble the columns on which 
filters are applied (which are usually a few), then apply the filter and decide 
whether to keep the row or not , and then goes on to assemble the remaining 
columns or skip the remaining columns accordingly.

The performance improvement by this change is seen to be significant , and is 
better in case smaller number of rows are returned by filtering (which is 
usually the case) and there are many number of columns


> Optimize the parquet RecordReader implementation when:  A. filterpredicate is 
> pushed down , B. filterpredicate is pushed down on a flat schema 
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-128
>                 URL: https://issues.apache.org/jira/browse/PARQUET-128
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.6.0rc2
>            Reporter: Yash Datta
>             Fix For: parquet-mr_1.6.0
>
>
> The RecordReader implementation currently will read all the columns before 
> applying the filter predicate and deciding whether to keep the row or discard 
> it.
> We can have a RecordReader which will only assemble the columns on which 
> filters are applied (which are usually a few), then apply the filter and 
> decide whether to keep the row or not , and then goes on to assemble the 
> remaining columns or skip the remaining columns accordingly.
> Also for applications like spark sql , the schema usually applied is a flat 
> one with no repeating or nested columns. In such cases, its better to have a 
> light-weight, faster RecordReader.
> The performance improvement by this change is seen to be significant , and is 
> better in case smaller number of rows are returned by filtering (which is 
> usually the case) and there are many number of columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to