GitHub user marmbrus opened a pull request:
https://github.com/apache/spark/pull/1759
[SPARK-2816][SQL] Type-safe SQL Queries
**This is an experimental feature of Spark SQL and is intended primarily to
get feedback from users. APIs may change in future versions.**
This PR adds a string interpolator that allows users to run Spark SQL
queries that return type-safe
results in Scala. SQL interpolation is invoked by prefixing a string
literal with `sql`, and supports including RDDs using `$`. For example:
```scala
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Person(firstName: String, lastName: String, age: Int)
val people = sc.makeRDD(Person("Michael", "Armbrust", 30) :: Nil)
val michaels = sql"SELECT * FROM $people WHERE firstName = 'Michael'"
```
The result RDDs of interpolated SQL queries contain [Scala
records](https://github.com/scala-records/scala-records) that have been
*refined* with the output schema of the query. This refinement means that you
can access the columns of the result as you would normal fields of objects in
scala, and that these fields will return the correct type. Continuing the
previous example:
```scala
assert(michaels.first().firstName == "Michael")
```
You can also use interpolation to include labmda functions that are in
scope as UDFs.
```scala
import java.util.Calendar
val birthYear = (age: Int) => Calendar.getInstance().get(Calendar.YEAR) -
age
val years = sql"SELECT $birthYear(age) FROM $people"
```
Results can also be refined into existing case class types when the names
of the columns match up with the arguments to the class's constructor.
```scala
case class Employee(name: String, birthYear: Int)
val employees: RDD[Employee] =
sql"SELECT lastName AS name, $birthYear(age) AS birthYear FROM
$people".map(_.to[Employee])
```
Known limitations:
- SQL Interpolation will only work then the included RDDs are of case
classes and the type of the case class can be determined statically at compile
time.
- Null values for primitive columns will raise an Exception.
- Escapes in strings may not be handled correctly.
- Doesn't work with `"""` and new lines
Thanks to @gzm0 @vjovanov @hubertp @densh for Scala records and @ahirreddy
for the initial work on the interpolator.
TODO:
- [ ] Maven build
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/marmbrus/spark typedSql
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1759.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1759
----
commit 2fd5a85d1af4566cb1f9505c2a722aecc8468a11
Author: Michael Armbrust <[email protected]>
Date: 2014-07-07T05:40:35Z
WIP: Typed SQL queries
commit 457d699e6f8d16c07e743e5e35a37dbe0e24f30d
Author: Michael Armbrust <[email protected]>
Date: 2014-07-07T07:20:06Z
Now with more than one relation.
commit c6c60e38cd7bb8b8878bf1e010c910e88bb372c5
Author: Tobias Schlatter <[email protected]>
Date: 2014-07-08T11:41:57Z
Remove intermediate map for records. Allow serialization
commit 3d4ce6729dbae4c9206ee225792306be25531f0c
Author: Michael Armbrust <[email protected]>
Date: 2014-07-11T23:00:24Z
Merge pull request #4 from gzm0/typedSql
Remove intermediate map for records. Allow serialization
commit 157d242465cfa945842c7e268f96d250e28962fa
Author: Tobias Schlatter <[email protected]>
Date: 2014-07-17T11:37:42Z
Add specialization to record implementation
commit 24f8d1690990e45d1add37febe4c0ab661075b46
Author: Michael Armbrust <[email protected]>
Date: 2014-07-22T01:19:32Z
Merge pull request #5 from gzm0/typedSql
Add specialization to record implementation
commit ac067cb4f7577f9fed145a42974b1e2e6e51d14d
Author: Michael Armbrust <[email protected]>
Date: 2014-07-22T01:39:26Z
Add nested test case
commit 83dd0928db6d1109a9290dd14e7208c90ee75c60
Author: Tobias Schlatter <[email protected]>
Date: 2014-07-22T12:19:59Z
Fix records version to 0.1
commit ae5ecaf56fe2dab90327635dcc58e59ab236bb4d
Author: Tobias Schlatter <[email protected]>
Date: 2014-07-22T12:20:53Z
Handle nested fields
commit b38fef3b7520d68668c235922ed229d2e0a5b20f
Author: Tobias Schlatter <[email protected]>
Date: 2014-07-22T12:31:08Z
Refactor ScalaReflection to support compile-time reflection
commit 49be122631468b858b5ba131c0b3c5fc51c05db3
Author: Michael Armbrust <[email protected]>
Date: 2014-07-24T02:09:42Z
Merge pull request #7 from gzm0/refactor-scala-reflection
Refactor scala reflection
commit e4f8c49eaa52142bcbd9158253cd30f55b04f323
Author: Michael Armbrust <[email protected]>
Date: 2014-08-03T22:44:35Z
Merge remote-tracking branch 'origin/master' into typedSql
Conflicts:
project/SparkBuild.scala
sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
commit 4d62fb58fac6bafd1ee65bea62d843b6f2106350
Author: Michael Armbrust <[email protected]>
Date: 2014-08-03T23:28:26Z
Merge remote-tracking branch 'marmbrus/typedSql' into typedSql
Conflicts:
project/SparkBuild.scala
commit d64c860df8e2fe8b7d14190ebd160c2c1d312c88
Author: Michael Armbrust <[email protected]>
Date: 2014-08-04T01:24:10Z
Add udf support.
commit 5b3ab551c9eb804971b24f9291504f1d5703223c
Author: Michael Armbrust <[email protected]>
Date: 2014-08-04T01:38:49Z
Docs and private.
commit ce7dd36f1fcbde43bd22c5fd12006059d32ae5f0
Author: Michael Armbrust <[email protected]>
Date: 2014-08-04T01:40:53Z
Merge remote-tracking branch 'origin/master' into typedSql
Conflicts:
core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
commit 760466a4d515fb2cec37e47f0943c32a5d274c7a
Author: Michael Armbrust <[email protected]>
Date: 2014-08-04T01:41:24Z
spurious change
commit 2b73b47653f350ca9911243f8b23c1218512660e
Author: Michael Armbrust <[email protected]>
Date: 2014-08-04T02:52:45Z
some quote handling, case sensitivity
commit ca471036a48a6146bbf276d58714db9e511ddfb1
Author: Michael Armbrust <[email protected]>
Date: 2014-08-04T02:53:58Z
printlns
commit 6e33920c4c046a80195ee168f0d0d731fc8306af
Author: Michael Armbrust <[email protected]>
Date: 2014-08-04T03:06:21Z
formatting
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]