GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/1639
[SPARK-2737] Add retag() method for changing RDDs' ClassTags.
The Java API's use of fake ClassTags doesn't seem to cause any problems for
Java users, but it can lead to issues when passing JavaRDDs' underlying RDDs to
Scala code (e.g. in the MLlib Java API wrapper code). If we call collect() on a
Scala RDD with an incorrect ClassTag, this causes ClassCastExceptions when we
try to allocate an array of the wrong type (for example, see SPARK-2197).
There are a few possible fixes here. An API-breaking fix would be to
completely remove the fake ClassTags and require Java API users to pass
java.lang.Class instances to all parallelize() calls and add returnClass fields
to all Function implementations. This would be extremely verbose.
Instead, this patch adds internal APIs to "repair" a Scala RDD with an
incorrect ClassTag by wrapping it and overriding its ClassTag. This should be
okay for cases where the Scala code that calls collect() knows what type of
array should be allocated, which is the case in the MLlib wrappers.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark SPARK-2737
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1639.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1639
----
commit eb1c7feec65a04ffc9d343ea74c19137161ce44f
Author: Josh Rosen <[email protected]>
Date: 2014-07-29T21:16:47Z
[SPARK-2737] Add retag() method for changing RDDs' ClassTags.
The Java API's use of fake ClassTags doesn't seem to cause any problems for
Java users, but it can lead to issues when passing JavaRDDs' underlying
RDDs to
Scala code (e.g. in the MLlib Java API wrapper code). If we call collect()
on
a Scala RDD with an incorrect ClassTag, this causes ClassCastExceptions
when we
try to allocate an array of the wrong type (for example, see SPARK-2197).
There are a few possible fixes here. An API-breaking fix would be to
completely
remove the fake ClassTags and require Java API users to pass java.lang.Class
instances to all parallelize() calls and add returnClass fields to all
Function
implementations. This would be extremely verbose.
Instead, this patch adds internal APIs to "repair" a Scala RDD with an
incorrect ClassTag by wrapping it and overriding its ClassTag. This should
be
okay for cases where the Scala code that calls collect() knows what type of
array should be allocated, which is the case in the MLlib wrappers.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---