[jira] [Commented] (PARQUET-98) filter2 API performance regression

Alex Levenson (JIRA) Mon, 09 Mar 2015 14:14:48 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353615#comment-14353615
 ]


Alex Levenson commented on PARQUET-98:
--------------------------------------

[~phraktle] well that's my point, even though you don't need to read A and B, 
you've got to skip their current values so you can move on to the next record. 
That often involves essentially reading the value (think delta encoding for 
example). But as [~rdblue] said, we might be doing more work than needed to do 
the skip. Definitely worth investigating, and definitely room for improvement. 

> filter2 API performance regression
> ----------------------------------
>
>                 Key: PARQUET-98
>                 URL: https://issues.apache.org/jira/browse/PARQUET-98
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Viktor Szathmáry
>
> The new filter API seems to be much slower (or perhaps I'm using it wrong \:)
> Code using an UnboundRecordFilter:
> {code:java}
> ColumnRecordFilter.column(column,
>     ColumnPredicates.applyFunctionToBinary(
>     input -> Binary.fromString(value).equals(input)));
> {code}
> vs. code using FilterPredicate:
> {code:java}
> eq(binaryColumn(column), Binary.fromString(value));
> {code}
> The latter performs twice as slow on the same Parquet file (built using 
> 1.6.0rc2).
> Note: the reader is constructed using
> {code:java}
> ParquetReader.builder(new ProtoReadSupport().withFilter(filter).build()
> {code}
> The new filter API based approach seems to create a whole lot more garbage 
> (perhaps due to reconstructing all the rows?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-98) filter2 API performance regression

Reply via email to