[
https://issues.apache.org/jira/browse/SPARK-22536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-22536.
----------------------------------
Resolution: Incomplete
> VectorizedParquetRecordReader doesn't use Parquet's dictionary filtering
> feature
> --------------------------------------------------------------------------------
>
> Key: SPARK-22536
> URL: https://issues.apache.org/jira/browse/SPARK-22536
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.2.0
> Environment: Spark 2.2.0
> Reporter: Ivan Gozali
> Priority: Major
> Labels: bulk-closed, filter2, parquet, predicate, pushdown
>
> The VectorizedParquetRecordReader currently only uses statistics filtering,
> and does not make use of dictionary filtering in Parquet. Having dictionary
> filtering would be very useful for string/binary columns that have low
> cardinality
> Some relevant code paths:
> *
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L367-L387
> When vectorizedReader is enabled, the code will use
> VectorizedParquetRecordReader, which uses SpecificParquetRecordReaderBase
> below
> *
> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L109
> This is where the row group filtering is being performed. It calls the
> method below
> *
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java#L64-L70
> The RowGroupFilter constructor used in Spark's
> {{VectorizedParquetRecordReader}} hard-codes the {{FilterLevel}} used to only
> {{FilterLevel.STATISTICS}}, and is deprecated.
> {code}
> @Deprecated
> private RowGroupFilter(List<BlockMetaData> blocks, MessageType schema) {
> this.blocks = checkNotNull(blocks, "blocks");
> this.schema = checkNotNull(schema, "schema");
> this.levels = Collections.singletonList(FilterLevel.STATISTICS);
> this.reader = null;
> {code}
> Compare this to
> {{org.apache.parquet.hadoop.ParquetRecordReader.initialize()}}, which uses
> the second RowGroupFilter constructor that allows it to set the
> {{FilterLevel}}. Relevant code here:
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordReader.java#L166-L182
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]