[GitHub] spark pull request: PySpark API for SparkSQL

ahirreddy Tue, 08 Apr 2014 19:23:21 -0700

GitHub user ahirreddy opened a pull request:

    https://github.com/apache/spark/pull/363


    PySpark API for SparkSQL

    An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD 
composed of dictionaries, with string keys and primitive values (boolean, 
float, int, long, string) can be converted into a SchemaRDD that supports sql 
queries.
    
    ```
    from pyspark.context import SQLContext
    sqlCtx = SQLContext(sc)
    rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, 
"field2": "row2"}, {"field1" : 3, "field2": "row3"}])
    srdd = sqlCtx.applySchema(rdd)
    sqlCtx.registerRDDAsTable(srdd, "table1")
    srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
    srdd2.collect()
    ```
    The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": 
"row2"}, {"f1" : 3, "f2": "row3"}]```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ahirreddy/spark pysql

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/363.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #363
    
----
commit b4bc82d2072e0ddb2204a04404d7afa2b2263aa9
Author: Ahir Reddy <[email protected]>
Date:   2014-04-06T22:00:47Z

    compiling

commit b6f4feb3c4917f463d2f54647dd2781a20fc63bc
Author: Ahir Reddy <[email protected]>
Date:   2014-04-06T22:03:59Z

    Java to python

commit 5cb8dc05a03a74f37f6cdaff165b3f1d1a94c1db
Author: Ahir Reddy <[email protected]>
Date:   2014-04-06T23:07:10Z

    java to python, and python to java

commit d2c60af513afca5aec0292316c9c0516de66927f
Author: Ahir Reddy <[email protected]>
Date:   2014-04-07T04:41:09Z

    Added schema rdd class

commit 949071bfd269f0ac608bfa470474c91cae97f91f
Author: Ahir Reddy <[email protected]>
Date:   2014-04-07T22:45:55Z

    doesn't crash

commit 9cb15c858dbacfe6156c7289575d0d1baa5a986c
Author: Ahir Reddy <[email protected]>
Date:   2014-04-07T23:09:22Z

    working

commit 730803e0843a3497d4bdf663a86363b33a8883c2
Author: Ahir Reddy <[email protected]>
Date:   2014-04-07T23:47:36Z

    more working

commit 837bd13bfa2e757ca6cdbe79af1ae00cba7749f0
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T00:16:57Z

    even better

commit 224add86bf0ca3af5c478d8189103463f2ed9918
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T01:26:48Z

    yippie

commit f16524d873d5b7e1f881d1d2bab66a88f9193bd7
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T04:25:14Z

    Switched to using Scala SQLContext

commit d69594dca922f87aa4ac05c4ab0b59a47eb12e5b
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T05:11:23Z

    returning dictionaries works

commit 337ed16ea5d30fc9e51415607cde2f24219c5624
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T05:17:48Z

    output dictionaries correctly

commit ed9e3b447f0114e8bbe02166fb61d7965a4eb641
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T05:25:26Z

    return row objects

commit 2d44498d9932821437fc3c0794eafe357591d86d
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T05:42:19Z

    awesome row objects

commit 1f6e3436291572bbcda267179b320daa939e7b8e
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T06:13:01Z

    SchemaRDD now has all RDD operations

commit ef91795554afd59c5fefa61721df62354094b92d
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T06:19:05Z

    made jrdd explicitly lazy

commit ec5b6e63782f3d181546bcd00bfb05e039a52b1d
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T06:33:52Z

    for now only allow dictionaries as input

commit 6c690e590e214c6f4e4e7f28eaedb30874df5ec6
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T06:36:53Z

    added todo explaining cost of creating Row object in python

commit 7e270b49a042e3e5f98ac030f3371eac838a716a
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T19:01:06Z

    adding tests

commit 90ab8f5365df1a1db98dbf7f7ae00f7c1ae4fa6f
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T19:32:12Z

    added test

commit 6417b7cbd99b710fa7b25d6858f843b3582e95c2
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T19:52:40Z

    added more tests :)

commit be5734e3ae0ff589bf5a25c02fbc819eaf0c0a1e
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T20:20:08Z

    added more tests

commit 22413b350e14d5d5103be3d8dce07660f86283fd
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T22:29:55Z

    Added pyrolite dependency

commit 3e874c6fca86e0c407b88d1cd861b1f1ba1fe685
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T22:43:32Z

    Added tests and documentation

commit 068ff77e84b5f8d72d32a1969f8f116d3cbd9f09
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T22:48:30Z

    doctest formatting

commit 052b4b70a2909cbb0b6fc1e2c61d066765fccc11
Author: Ahir Reddy <[email protected]>
Date:   2014-04-08T22:53:04Z

    cleaning up cruft

commit 08580e1f1001cc361a63ac9122ac4cb86f0abaff
Author: Ahir Reddy <[email protected]>
Date:   2014-04-09T01:01:31Z

    HiveContexts

commit 83a0cc6c9690c6bbd133d2e9e2b284c6d03ab0da
Author: Ahir Reddy <[email protected]>
Date:   2014-04-09T02:08:36Z

    Added Long, Double and Boolean as usable types + unit test

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Reply via email to