GitHub user ahirreddy opened a pull request:
https://github.com/apache/spark/pull/363
PySpark API for SparkSQL
An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD
composed of dictionaries, with string keys and primitive values (boolean,
float, int, long, string) can be converted into a SchemaRDD that supports sql
queries.
```
from pyspark.context import SQLContext
sqlCtx = SQLContext(sc)
rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2,
"field2": "row2"}, {"field1" : 3, "field2": "row3"}])
srdd = sqlCtx.applySchema(rdd)
sqlCtx.registerRDDAsTable(srdd, "table1")
srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
srdd2.collect()
```
The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2":
"row2"}, {"f1" : 3, "f2": "row3"}]```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ahirreddy/spark pysql
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/363.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #363
----
commit b4bc82d2072e0ddb2204a04404d7afa2b2263aa9
Author: Ahir Reddy <[email protected]>
Date: 2014-04-06T22:00:47Z
compiling
commit b6f4feb3c4917f463d2f54647dd2781a20fc63bc
Author: Ahir Reddy <[email protected]>
Date: 2014-04-06T22:03:59Z
Java to python
commit 5cb8dc05a03a74f37f6cdaff165b3f1d1a94c1db
Author: Ahir Reddy <[email protected]>
Date: 2014-04-06T23:07:10Z
java to python, and python to java
commit d2c60af513afca5aec0292316c9c0516de66927f
Author: Ahir Reddy <[email protected]>
Date: 2014-04-07T04:41:09Z
Added schema rdd class
commit 949071bfd269f0ac608bfa470474c91cae97f91f
Author: Ahir Reddy <[email protected]>
Date: 2014-04-07T22:45:55Z
doesn't crash
commit 9cb15c858dbacfe6156c7289575d0d1baa5a986c
Author: Ahir Reddy <[email protected]>
Date: 2014-04-07T23:09:22Z
working
commit 730803e0843a3497d4bdf663a86363b33a8883c2
Author: Ahir Reddy <[email protected]>
Date: 2014-04-07T23:47:36Z
more working
commit 837bd13bfa2e757ca6cdbe79af1ae00cba7749f0
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T00:16:57Z
even better
commit 224add86bf0ca3af5c478d8189103463f2ed9918
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T01:26:48Z
yippie
commit f16524d873d5b7e1f881d1d2bab66a88f9193bd7
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T04:25:14Z
Switched to using Scala SQLContext
commit d69594dca922f87aa4ac05c4ab0b59a47eb12e5b
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T05:11:23Z
returning dictionaries works
commit 337ed16ea5d30fc9e51415607cde2f24219c5624
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T05:17:48Z
output dictionaries correctly
commit ed9e3b447f0114e8bbe02166fb61d7965a4eb641
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T05:25:26Z
return row objects
commit 2d44498d9932821437fc3c0794eafe357591d86d
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T05:42:19Z
awesome row objects
commit 1f6e3436291572bbcda267179b320daa939e7b8e
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T06:13:01Z
SchemaRDD now has all RDD operations
commit ef91795554afd59c5fefa61721df62354094b92d
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T06:19:05Z
made jrdd explicitly lazy
commit ec5b6e63782f3d181546bcd00bfb05e039a52b1d
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T06:33:52Z
for now only allow dictionaries as input
commit 6c690e590e214c6f4e4e7f28eaedb30874df5ec6
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T06:36:53Z
added todo explaining cost of creating Row object in python
commit 7e270b49a042e3e5f98ac030f3371eac838a716a
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T19:01:06Z
adding tests
commit 90ab8f5365df1a1db98dbf7f7ae00f7c1ae4fa6f
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T19:32:12Z
added test
commit 6417b7cbd99b710fa7b25d6858f843b3582e95c2
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T19:52:40Z
added more tests :)
commit be5734e3ae0ff589bf5a25c02fbc819eaf0c0a1e
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T20:20:08Z
added more tests
commit 22413b350e14d5d5103be3d8dce07660f86283fd
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T22:29:55Z
Added pyrolite dependency
commit 3e874c6fca86e0c407b88d1cd861b1f1ba1fe685
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T22:43:32Z
Added tests and documentation
commit 068ff77e84b5f8d72d32a1969f8f116d3cbd9f09
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T22:48:30Z
doctest formatting
commit 052b4b70a2909cbb0b6fc1e2c61d066765fccc11
Author: Ahir Reddy <[email protected]>
Date: 2014-04-08T22:53:04Z
cleaning up cruft
commit 08580e1f1001cc361a63ac9122ac4cb86f0abaff
Author: Ahir Reddy <[email protected]>
Date: 2014-04-09T01:01:31Z
HiveContexts
commit 83a0cc6c9690c6bbd133d2e9e2b284c6d03ab0da
Author: Ahir Reddy <[email protected]>
Date: 2014-04-09T02:08:36Z
Added Long, Double and Boolean as usable types + unit test
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---