GitHub user freeman-lab opened a pull request:

    https://github.com/apache/spark/pull/3902

    [SPARK-5089][PYSPARK][MLLIB] Fix vector convert

    This is a small change addressing a potentially significant bug in how 
PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to 
`DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark 
should automatically upcast to float64s, but currently this wasn't actually 
happening. As a result, non-float64 would be silently parsed inappropriately 
during SerDe, yielding erroneous results when running, for example, KMeans.
    
    The PR includes the fix, as well as a new test for the correct conversion 
behavior.
    
    @davies

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/freeman-lab/spark fix-vector-convert

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3902.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3902
    
----
commit 704f97ec9f003b4fc99e7a4274ece7a654577eee
Author: freeman <[email protected]>
Date:   2015-01-05T19:02:01Z

    Return array after changing type

commit 764db47a72ca2727772a444bfd7251b05c7dcf16
Author: freeman <[email protected]>
Date:   2015-01-05T19:04:59Z

    Add a test for proper conversion behavior

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to