Ivan Gozali created SPARK-22536:
-----------------------------------

             Summary: VectorizedParquetRecordReader doesn't use Parquet's 
dictionary filtering feature
                 Key: SPARK-22536
                 URL: https://issues.apache.org/jira/browse/SPARK-22536
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.2.0
         Environment: Spark 2.2.0
            Reporter: Ivan Gozali


The VectorizedParquetRecordReader currently only uses statistics filtering, and 
does not make use of dictionary filtering in Parquet. Having dictionary 
filtering would be very useful for string/binary columns that have low 
cardinality

Some relevant code paths:
* 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L367-L387
 When vectorizedReader is enabled, the code will use 
VectorizedParquetRecordReader, which uses SpecificParquetRecordReaderBase below
* 
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L109
 This is where the row group filtering is being performed. It calls the method 
below
* 
https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java#L64-L70

The RowGroupFilter constructor used in Spark's 
{{VectorizedParquetRecordReader}} hard-codes the {{FilterLevel}} used to only 
{{FilterLevel.STATISTICS}}, and is deprecated.
{code}
  @Deprecated
  private RowGroupFilter(List<BlockMetaData> blocks, MessageType schema) {
    this.blocks = checkNotNull(blocks, "blocks");
    this.schema = checkNotNull(schema, "schema");
    this.levels = Collections.singletonList(FilterLevel.STATISTICS);
    this.reader = null;
{code}

Compare this to {{org.apache.parquet.hadoop.ParquetRecordReader.initialize()}}, 
which uses the second RowGroupFilter constructor that allows it to set the 
{{FilterLevel}}. Relevant code here:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordReader.java#L166-L182






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to