[ 
https://issues.apache.org/jira/browse/PARQUET-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352715#comment-14352715
 ] 

Viktor Szathmáry commented on PARQUET-98:
-----------------------------------------


While I'm not familiar with the internals of the implementation, the main 
difference seems to be that with filter2, all columns are deserialized for all 
rows. Wheres the classic approach, only the selector column is involved.

> filter2 API performance regression
> ----------------------------------
>
>                 Key: PARQUET-98
>                 URL: https://issues.apache.org/jira/browse/PARQUET-98
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Viktor Szathmáry
>
> The new filter API seems to be much slower (or perhaps I'm using it wrong \:)
> Code using an UnboundRecordFilter:
> {code:java}
> ColumnRecordFilter.column(column,
>     ColumnPredicates.applyFunctionToBinary(
>     input -> Binary.fromString(value).equals(input)));
> {code}
> vs. code using FilterPredicate:
> {code:java}
> eq(binaryColumn(column), Binary.fromString(value));
> {code}
> The latter performs twice as slow on the same Parquet file (built using 
> 1.6.0rc2).
> Note: the reader is constructed using
> {code:java}
> ParquetReader.builder(new ProtoReadSupport().withFilter(filter).build()
> {code}
> The new filter API based approach seems to create a whole lot more garbage 
> (perhaps due to reconstructing all the rows?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to