GitHub user BryanCutler opened a pull request:

    https://github.com/apache/spark/pull/14725

    [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function 
to create py4j JavaArrays

    ## What changes were proposed in this pull request?
    Adding convenience functions to Python `JavaWrapper` so that it is easy to 
create a py4j JavaArray that is compatible with current class constructors that 
have a Scala `Array` as input.  
    
    Two functions are added here, one for primitive data types that will check 
the type of the Python List and automatically create the right JavaArray type, 
and one that takes the Java class as input to allow for custom classes to be 
made into a JavaArray.
    
    Usage in actual ML classes would be similar to below
    ```
    class CountVectorizerModel():
        def __init__(self, vocab):
            jvocab = CountVectorizerModel._new_java_primitive_array(vocab)
            model = CountVectorizerModel._create_from_java_class(
              "org.apache.spark.ml.feature.CountVectorizerModel", jvocab)
            return model
    ...
    cvm - CountVectorizerModel(["a", "b", "c"])
    ```
    
    ## How was this patch tested?
    Added unit tests for new functionality and tested constructing a 
CountVectorizerModel from a list of vocab strings.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BryanCutler/spark 
pyspark-new_java_array-CountVectorizer-SPARK-17161

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14725.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14725
    
----
commit 2a8de605f0dfe1a1baf3748602ad10c06476198d
Author: Bryan Cutler <[email protected]>
Date:   2016-07-14T17:16:03Z

    testing out _new_java_array

commit 97bff0753f8b94ead97d68206268b5ba58abab6c
Author: Bryan Cutler <[email protected]>
Date:   2016-08-19T17:21:39Z

    Merge remote-tracking branch 'upstream/master' into 
wip-pyspark-new_java_array-CountVectorizer

commit 4766cdcdd6bd10e9e48212c1513dceb6684663c2
Author: Bryan Cutler <[email protected]>
Date:   2016-08-19T23:14:48Z

    undo changes to CountVectorizerModel used for testing

commit 1c0ddb92e32470e77fe2b7cfa675eb1c908bc713
Author: Bryan Cutler <[email protected]>
Date:   2016-08-19T23:15:56Z

    added convienience functions to JavaWrapper to create py4j JavaArray

commit f9672bfe34b1b5f5ea14700d2aaaee055f5323f8
Author: Bryan Cutler <[email protected]>
Date:   2016-08-19T23:20:16Z

    fixed style checks and tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to