GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/2919
[SPARK-3572] [sql] [mllib] User-Defined Types and MLlib Datasets
This PR adds User-Defined Types (UDTs) to SQL. It is a precursor to using
SchemaRDD as a Dataset for the new MLlib API.
## Main additions
Public API
* SQL
* Added annotation SQLUserDefinedType (DeveloperApi)
* Added UDTRegistry (global object)
* Added abstract class UserDefinedType
* MLlib
* Vector, DenseVector, SparseVector are annotated with SQLUserDefinedType
Internals
* Made MLlib depend on SparkSQL.
* SQL
* ScalaReflection
* Methods for converting between Scala and Catalyst types now take
DataType.
* convertRowToScala added in several locations in SQL
* schemaFor checks for SQLUserDefinedType annotation and checks
UDTRegistry
* MLlib
* Added VectorUDT, DenseVectorUDT, SparseVectorUDT (private[spark])
Examples
* /examples/mllib/DatasetExample.scala: Demonstrates implicit conversion of
RDD[LabeledPoint] to SchemaRDD
Unit Tests
* mllib/rdd/DatasetSuite.scala: Tests *VectorUDT
* sql/UserDefinedTypeSuite.scala: Tests fake version of DenseVector
## Design decisions
* UDTs override types natively recognized by SQL.
* Question: Should users be able to override primitive or built-in types?
## Items left for future PRs
* Java and Python APIs
* Serialization (Parquet, etc.)
CC: @mengxr @marmbrus
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark sql-udt
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2919.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2919
----
commit 7060cdd719a81b243f82f7e21a165a492daf79a1
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-03T02:06:49Z
Adding UserDefinedType to SQL, not done yet.
commit 48d644de752df6e34b630991a296d46cb8049247
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-03T02:20:01Z
Merge remote-tracking branch 'upstream/master' into sql-udt
commit 5b6612848e30b9c415d46c93306f8cdacdc87ea7
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-03T19:47:29Z
Merge remote-tracking branch 'upstream/master' into sql-udt
Conflicts:
sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
commit 3d94153b85b972a2aace5296eb6e11dceecbba8e
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-04T01:04:32Z
Merge remote-tracking branch 'upstream/master' into sql-udt
commit b9df66e6fc7dff838065212be5277406f562d6c6
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-06T16:54:51Z
Still working on UDTs
commit 1dc68146fc85dc32abfe1fb389a263c48bcd7c3f
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-06T17:06:03Z
Merge remote-tracking branch 'upstream/master' into sql-udt
Conflicts:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
commit 92891b9ac2a11e80417aae371564f1b50725677e
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-06T20:18:26Z
still working on UDTs
commit f91e6afd522b615c69f2a2bc815975044e3daa34
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-07T02:10:43Z
still working on UDTs
commit d4b3209836c18dd6bacb13e6dfeeb90e6a53bc58
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-07T02:15:25Z
Merge remote-tracking branch 'upstream/master' into sql-udt
commit 2f835b78fbd7f49e80aceb6000292df8e9de4b54
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-07T19:25:45Z
more udts...
commit 283a8aaf6a77496b7a0ff8d0c2a4fea9429924cc
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-07T22:51:07Z
commented out convertRowToScala for debugging
commit 521eb945357805992040da917ed89b73f84fd089
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-07T23:43:30Z
Merge remote-tracking branch 'upstream/master' into sql-udt
commit 86815d1600c776b1d4cdc3d93c748729afd635ac
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-08T02:22:10Z
basic UDT is working, but deserialization has yet to be done
commit 007c84fb7ba205083ba72086733219a9e5aa88f4
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-08T02:31:59Z
removed old udt suite
commit 8aa3b20f825eb0469c6e13cb67d25047a58ddb43
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-09T03:33:06Z
Merge remote-tracking branch 'upstream/master' into sql-udt
Conflicts:
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
commit 19ae3f6a72b9b715360efa76a03a142e71d8c6be
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-09T19:39:44Z
udts
commit ef010b553a2500758dedc19d8ef47c8db7d21ee9
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-09T19:39:49Z
Merge remote-tracking branch 'upstream/master' into sql-udt
commit 8b2222fab864047ce88860f9180b1fe6f9fd8258
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-09T20:09:15Z
udts
commit f02b01def0731d956fd708df737c6043bb68b019
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-09T21:18:41Z
udt finallly working
commit ceb886e6a1ad52e45d85eb76da28c9b172d7193c
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-09T21:48:33Z
some cleanups
commit 47de90af0ab29cbaa712c6a4e13d312edd265108
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-09T21:56:09Z
more cleanups
commit b8d0adeb9fbd6ed74a67bdfd8725c73d104aa86a
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-10T17:33:37Z
Changing UDT to annotation
commit 7fae92842b3e9786a2bd6fec04b307a7023ad837
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-10T18:53:27Z
udt annotation now working
commit 530022eb1f0243e38cf49a67df62d60c38cf975f
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-10T18:53:37Z
Merge remote-tracking branch 'upstream/master' into sql-udt
commit 77a03056f048f070eb288c27ca42da2fd57a72b1
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-10T20:13:35Z
renamed UDT types
commit db093877d728e4d76e063eb63c1c91c938a3a63b
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-10T21:02:32Z
blah
commit 494347741ec949c499faaf7baa130d12fd988d93
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-10T22:14:29Z
Added MLlib dependency on SQL.
commit df1e069ed08eb55cc5b03388f784f97fc481d492
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-17T18:58:01Z
Merge remote-tracking branch 'upstream/master' into sql-udt
commit d41a5963e81337f80c9f7286b161a9fe18257e15
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-18T00:24:10Z
Merge remote-tracking branch 'upstream/master' into sql-udt
commit 24b054bca6bb43b236835ed7d9848064c5d5d130
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-20T19:39:42Z
Merge remote-tracking branch 'upstream/master' into sql-udt
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]