[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

mallman Fri, 03 Nov 2017 00:58:36 -0700

Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16578#discussion_r148722822
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
 ---
    @@ -127,8 +127,8 @@ private[parquet] class ParquetRowConverter(
       extends ParquetGroupConverter(updater) with Logging {
     
       assert(
    -    parquetType.getFieldCount == catalystType.length,
    -    s"""Field counts of the Parquet schema and the Catalyst schema don't 
match:
    +    parquetType.getFieldCount <= catalystType.length,
    --- End diff --
    
    In `ParquetReadSupport.scala`, when `parquetMrCompatibility` is `true`, we 
intersect the clipped parquet schema with the underlying parquet file's schema. 
This can result in a requested parquet schema with fewer fields than the 
requested catalyst schema.
    
    For example, in the case of a partitioned table where we select a column 
which doesn't exist in the schema of one partition's files, we will remove the 
missing columns from the requested parquet schema.
    
    This scenario is illustrated and tested by the "partial schema intersection 
- select missing subfield" test in `ParquetSchemaPruningSuite.scala`.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

Reply via email to