GitHub user felixcheung opened a pull request:
https://github.com/apache/spark/pull/13635
[SPARK-15159][SPARKR] SparkR SparkSession API
## What changes were proposed in this pull request?
This PR introduces the new SparkSession API for SparkR.
`sparkR.session.getOrCreate()` and `sparkR.session.stop()`
"getOrCreate" is a bit unusual in R but it's important to name this clearly.
SparkR implementation should
- SparkSession is the main entrypoint (vs SparkContext; due to limited
functionality supported with SparkContext in SparkR)
- SparkSession replaces SQLContext and HiveContext (both a wrapper around
SparkSession, and because of API changes, supporting all 3 would be a lot more
work)
- Changes to SparkSession is mostly transparent to users due to SPARK-10903
- Full backward compatibility is expected - users should be able to
initialize everything just in Spark 1.6.1 (sparkR.init()), but with deprecation
warning
- Mostly cosmetic changes to parameter list - users should be able to move
to sparkR.session.getOrCreate() easily
- An advanced syntax with named parameters (aka varargs aka "...") is
supported; that should be closer to the Builder syntax that is in Scala/Python
(which unfortunately does not work in R because it will look like this:
enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"),
"first", "value"), "next, "value"))
- Updating config on an existing SparkSession is supported, the behavior is
the same as Python, in which config is applied to both SparkContext and
SparkSession
- Some SparkSession changes are not match in SparkR, mostly because it
would be breaking API change: catalog object, createOrReplaceTempView
- Other SQLContext workarounds are replicated in SparkR, eg. `tables`,
`tableNames`
- A bug in `read.jdbc` is fixed
- `sparkR` shell is updated to use the SparkSession entrypoint
(`sqlContext` is removed, just like with Scale/Python)
- All tests are updated to use the SparkSession entrypoint
TODO
- [ ] Add more tests
- [ ] Separate PR - update all roxygen2 doc coding example
- [ ] Separate PR - update SparkR programming guide
## How was this patch tested?
unit tests, manual tests
@shivaram @sun-rui @rxin
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/felixcheung/spark rsparksession
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13635.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13635
----
commit bac797171a6d08ff4589b3dc5b6d6f6f587e4f26
Author: felixcheung <[email protected]>
Date: 2016-05-30T19:26:30Z
[WIP] SparkSession in R
commit 972641c1c6a72c56778c5880d608b126f25e017d
Author: Felix Cheung <[email protected]>
Date: 2016-06-09T15:25:14Z
more changes for spark session
commit 152a25549fa5f20e140215a340a83000d81a5750
Author: Felix Cheung <[email protected]>
Date: 2016-06-11T11:16:10Z
fix tests
commit a938655587cdb6f3d42507b77653bc29ed4e663e
Author: Felix Cheung <[email protected]>
Date: 2016-06-13T01:29:45Z
add support for updating config for existing session, fix read.jdbc
commit b494232fc144c3428fb6aa4328c8579d233abb5c
Author: Felix Cheung <[email protected]>
Date: 2016-06-13T06:05:15Z
fix style, test
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]