[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

HyukjinKwon Wed, 17 Aug 2016 02:53:30 -0700

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/14671
  
    Yea, that is all true. Actually, it would be okay just not to pass the 
filter 
[here](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L383)
 for the normal parquet reader because we are setting the filter for rowgroups 
[here](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L365)
 for both normal one and vectorized one.
    
    (BTW, you might already notice the actually filter2 for each row is related 
in `FilteringRecordMaterializer` using `ParquetRecordMaterializer` as a 
delegate in Spark side).
    
    However, if my understanding is correct, the point is, it _seems_ even not 
clear whether we should disable Parquet's row-by-row filtering or not. In my 
point of view, we need a benchmark to see if Spark's codegen one is really 
faster than Parquet's one for row-by-row.
    
    I _guess_ we are assuming Spark one is faster than Parquet's one.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

Reply via email to