GitHub user anselmevignon opened a pull request:
https://github.com/apache/spark/pull/4697
[SPARK-5775] BugFix: GenericRow cannot be cast to SpecificMutableRow when
nested data and partitioned table
The Bug solved here was due to a change in PartitionTableScan, when reading
a partitioned table.
- When the Partititon column is requested out of a parquet table, the Table
Scan needs to add the column back to the output Rows.
- To update the Row object created by PartitionTableScan, the Row was first
casted in SpecificMutableRow, before being updated.
- This casting was unsafe, since there are no guarantee that the
newHadoopRDD used internally will instanciate the output Rows as MutableRow.
Particularly, when reading a Table with complex (e.g. struct or Array)
types, the newHadoopRDD uses a parquet.io.api.RecordMateralizer, that is
produced by the org.apache.spark.sql.parquet.RowReadSupport . This consumer
will be created as a org.apache.spark.sql.parquet.CatalystGroupConverter (a)
and not a org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter (b), when
there are complex types involved (in the
org.apache.spark.sql.parquet.CatalystConverter.createRootConverter factory )
The consumer (a) will output GenericRow, while the consumer (b) produces
SpecificMutableRow.
Therefore any request selecting a partition columns, plus a complex type
column, are returned as GenericRows, and fails into an unsafe casting pit (see
https://issues.apache.org/jira/browse/SPARK-5775 for an example. )
The bugfix proposed here replace the unsafe class casting by a case
matching on the Row type, updates the Row if it is of a mutable type, and
recreate a Row if it is not.
This fix is unit-tested in
sql/hive/src/test/scala/org/apache/spark/sql/parquet/parquetSuites.scala
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/flaminem/spark local_dev
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4697.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4697
----
commit 4eb04e971244d2c49085e17ae4685a31e6808066
Author: Anselme Vignon <[email protected]>
Date: 2015-02-18T09:51:52Z
bugfix SPARK-5775
commit dbceaa308921f298b3cd9cc98fae66e1271c7f1c
Author: Anselme Vignon <[email protected]>
Date: 2015-02-18T11:17:38Z
cutting lines
commit f876dea96d50f9df0c4d9992e82b00d3a4a7968f
Author: Anselme Vignon <[email protected]>
Date: 2015-02-18T11:17:55Z
starting to write tests
commit ae48f7c98410d320b128ed23fb5c6cdbcb8b504c
Author: Anselme Vignon <[email protected]>
Date: 2015-02-19T18:08:48Z
unittesting SPARK-5775
commit 22cec5206091580e9922f997ef8052ded393d225
Author: Anselme Vignon <[email protected]>
Date: 2015-02-19T18:18:02Z
lint compatible changes
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]