GitHub user rdblue opened a pull request:

    https://github.com/apache/spark/pull/21295

    [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase with dictionary 
filters.

    ## What changes were proposed in this pull request?
    
    I missed this commit when preparing #21070.
    
    When Parquet is able to filter blocks with dictionary filtering, the 
expected total value count to be too high in Spark, leading to an error when 
there were fewer than expected row groups to process. Spark should get the row 
groups from Parquet to pick up new filter schemes in Parquet like dictionary 
filtering.
    
    ## How was this patch tested?
    
    By hand. Need to add a test case.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rdblue/spark 
SPARK-24230-fix-parquet-block-tracking

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21295.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21295
    
----
commit 0fa5abeffc6184979d5909e29eb43d33991ed832
Author: Ryan Blue <blue@...>
Date:   2018-01-31T00:48:01Z

    SPARK-24230: Fix SpecificParquetRecordReaderBase with dictionary filters.
    
    Filtered blocks were causing the expected total value count to be too
    high, which led to an error when there were fewer than expected row
    groups to process.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to