[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...

JoshRosen Tue, 29 Jul 2014 14:25:25 -0700

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/1639


    [SPARK-2737] Add retag() method for changing RDDs' ClassTags.

    The Java API's use of fake ClassTags doesn't seem to cause any problems for 
Java users, but it can lead to issues when passing JavaRDDs' underlying RDDs to 
Scala code (e.g. in the MLlib Java API wrapper code). If we call collect() on a 
Scala RDD with an incorrect ClassTag, this causes ClassCastExceptions when we 
try to allocate an array of the wrong type (for example, see SPARK-2197).
    
    There are a few possible fixes here. An API-breaking fix would be to 
completely remove the fake ClassTags and require Java API users to pass 
java.lang.Class instances to all parallelize() calls and add returnClass fields 
to all Function implementations. This would be extremely verbose.
    
    Instead, this patch adds internal APIs to "repair" a Scala RDD with an 
incorrect ClassTag by wrapping it and overriding its ClassTag. This should be 
okay for cases where the Scala code that calls collect() knows what type of 
array should be allocated, which is the case in the MLlib wrappers.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-2737

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1639.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1639
    
----
commit eb1c7feec65a04ffc9d343ea74c19137161ce44f
Author: Josh Rosen <[email protected]>
Date:   2014-07-29T21:16:47Z

    [SPARK-2737] Add retag() method for changing RDDs' ClassTags.
    
    The Java API's use of fake ClassTags doesn't seem to cause any problems for
    Java users, but it can lead to issues when passing JavaRDDs' underlying 
RDDs to
    Scala code (e.g. in the MLlib Java API wrapper code). If we call collect() 
on
    a Scala RDD with an incorrect ClassTag, this causes ClassCastExceptions 
when we
    try to allocate an array of the wrong type (for example, see SPARK-2197).
    
    There are a few possible fixes here. An API-breaking fix would be to 
completely
    remove the fake ClassTags and require Java API users to pass java.lang.Class
    instances to all parallelize() calls and add returnClass fields to all 
Function
    implementations. This would be extremely verbose.
    
    Instead, this patch adds internal APIs to "repair" a Scala RDD with an
    incorrect ClassTag by wrapping it and overriding its ClassTag. This should 
be
    okay for cases where the Scala code that calls collect() knows what type of
    array should be allocated, which is the case in the MLlib wrappers.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...

Reply via email to