[GitHub] spark pull request: [SPARK-5775] BugFix: GenericRow cannot be cast...

anselmevignon Thu, 19 Feb 2015 10:47:52 -0800

GitHub user anselmevignon opened a pull request:

    https://github.com/apache/spark/pull/4697


    [SPARK-5775] BugFix: GenericRow cannot be cast to SpecificMutableRow when 
nested data and partitioned table

    The Bug solved here was due to a change in PartitionTableScan, when reading 
a partitioned table. 
    
    - When the Partititon column is requested out of a parquet table, the Table 
Scan needs to add the column back to the output Rows. 
    - To update the Row object created by PartitionTableScan, the Row was first 
casted in SpecificMutableRow, before being updated.
    - This casting was unsafe, since there are no guarantee that the 
newHadoopRDD used internally will instanciate the output Rows as MutableRow. 
    
    Particularly, when reading a Table with complex (e.g. struct or Array) 
types,  the newHadoopRDD  uses a parquet.io.api.RecordMateralizer, that is 
produced by the org.apache.spark.sql.parquet.RowReadSupport . This consumer 
will be created as a org.apache.spark.sql.parquet.CatalystGroupConverter (a) 
and not a org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter (b), when 
there are complex types involved (in the 
org.apache.spark.sql.parquet.CatalystConverter.createRootConverter factory  )   
   
    
    The consumer (a) will output GenericRow, while the consumer (b) produces 
SpecificMutableRow. 
    
    Therefore any request selecting a partition columns, plus a complex type 
column, are returned as GenericRows, and fails into an unsafe casting pit (see 
https://issues.apache.org/jira/browse/SPARK-5775 for an example. ) 
    
    The bugfix proposed here replace the unsafe class casting by a case 
matching on the Row type, updates the Row if it is of a mutable type, and 
recreate a Row if it is not.
    
    This fix is unit-tested in  
sql/hive/src/test/scala/org/apache/spark/sql/parquet/parquetSuites.scala

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/flaminem/spark local_dev

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4697.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4697
    
----
commit 4eb04e971244d2c49085e17ae4685a31e6808066
Author: Anselme Vignon <[email protected]>
Date:   2015-02-18T09:51:52Z

    bugfix SPARK-5775

commit dbceaa308921f298b3cd9cc98fae66e1271c7f1c
Author: Anselme Vignon <[email protected]>
Date:   2015-02-18T11:17:38Z

    cutting lines

commit f876dea96d50f9df0c4d9992e82b00d3a4a7968f
Author: Anselme Vignon <[email protected]>
Date:   2015-02-18T11:17:55Z

    starting to write tests

commit ae48f7c98410d320b128ed23fb5c6cdbcb8b504c
Author: Anselme Vignon <[email protected]>
Date:   2015-02-19T18:08:48Z

    unittesting SPARK-5775

commit 22cec5206091580e9922f997ef8052ded393d225
Author: Anselme Vignon <[email protected]>
Date:   2015-02-19T18:18:02Z

    lint compatible changes

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5775] BugFix: GenericRow cannot be cast...

Reply via email to