GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/3070

    [SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD

    Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and 
Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and 
then select columns or save to a Parquet file. Examples in Scala/Python are 
attached. The Scala code was copied from @jkbradley.
    
    This PR contains the changes from #3068 . I will rebase after #3068 is 
merged.
    
    @marmbrus @jkbradley

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark SPARK-3573

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3070.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3070
    
----
commit b7f666d239ee4d0e242cffbce192ea62c0847c2b
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-03T03:12:31Z

    add Python UDT

commit 39f19e097aea1faa66b4d0f7a8a46646595292d6
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-03T04:15:16Z

    add tests

commit 4e84fcee485ff1968da1859c831a6897044fa3b2
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-03T04:15:46Z

    remove local hive tests and add more tests

commit e98d9d0d23ec9c2ec52cf0c6b7102315dde8c4e1
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-03T04:41:54Z

    fix py style

commit f740379371ec0c54bd6b8e0ecd2b730673a36d64
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-03T04:49:24Z

    remove UDT from default imports

commit 75223db28c64e937640826b55b62cda147ac59dd
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-03T04:51:15Z

    minor update

commit 2c986ec470be240e5eab3d887c907b75546fad36
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-03T04:57:03Z

    make mllib depend on sql

commit 4e22b51183dae9ec5122441b12930074c40d1f89
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-03T05:27:15Z

    add VectorUDT in Scala

commit 7c4a6a9ea84bb1d98d0718c254b70e8b36f912f5
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-03T05:43:46Z

    address comments

commit 3cadde9abfc9668a99bf67a09e39ae28fbcf0931
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-03T06:02:18Z

    Merge branch 'SPARK-4192-sql' into SPARK-3573

commit c0ca84a17987ad4e27b6af9b3a7ad71b3a609a75
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-03T06:25:59Z

    add VectorUDT in Python

commit 94762f8cb35d882149529c0f5378a05b12953a37
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-03T06:59:16Z

    copy Scala's DatasetExample from jkbradley

commit e8a5763d639e8d54936435abdbd798eb604d1b4d
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-03T06:59:37Z

    add Python's DatasetExample

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to