GitHub user shivaram opened a pull request:
https://github.com/apache/spark/pull/5096
[SPARK-5654] Integrate SparkR
This pull requests integrates SparkR, an R frontend for Spark. The SparkR
package contains both RDD and DataFrame APIs in R and is integrated with
Spark's submission scripts to work on different cluster managers.
Some integration points that would be great to get feedback on:
1. Build procedure: SparkR requires R to be installed on the machine to be
built. Right now we have a new Maven profile `-PsparkR` that can be used to
enable SparkR builds
2. YARN cluster mode: The R package that is built needs to be present on
the driver and all the worker nodes during execution. The R package location is
currently set using SPARK_HOME, but this might not work on YARN cluster mode.
The SparkR package represents the work of many contributors and attached
below is a list of people along with areas they worked on
edwardt (@edwart) - Documentation improvements
Felix Cheung (@felixcheung) - Documentation improvements
Hossein Falaki (@falaki) - Documentation improvements
Chris Freeman (@cafreeman) - DataFrame API, Programming Guide
Todd Gao (@7c00) - R worker Internals
Ryan Hafen (@hafen) - SparkR Internals
Qian Huang (@hqzizania) - RDD API
Hao Lin (@hlin09) - RDD API, Closure cleaner
Evert Lammerts (@evertlammerts) - DataFrame API
Davies Liu (@davies) - DataFrame API, R worker internals, Merging with
Spark
Yi Lu (@lythesia) - RDD API, Worker internals
Matt Massie (@massie) - Jenkins build
Harihar Nahak (@hnahak87) - SparkR examples
Oscar Olmedo (@oscaroboto) - Spark configuration
Antonio Piccolboni (@piccolbo) - SparkR examples, Namespace bug fixes
Dan Putler (@dputler) - Dataframe API, SparkR Install Guide
Ashutosh Raina (@ashutoshraina) - Build improvements
Josh Rosen (@joshrosen) - Travis CI build
Sun Rui (@sun-rui)- RDD API, JVM Backend, Shuffle improvements
Shivaram Venkataraman (@shivaram) - RDD API, JVM Backend, Worker Internals
Zongheng Yang (@concretevitamin) - RDD API, Pipelined RDDs, Examples and
EC2 guide
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/amplab-extras/spark R
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5096.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5096
----
commit 9aa4acfeb2180b5b7c44302e1500d1bfe0639485
Author: Shivaram Venkataraman <[email protected]>
Date: 2015-02-27T18:56:32Z
Merge pull request #184 from davies/socket
[SPARKR-155] use socket in R worker
commit 798f4536d9dfb069e0c8f1bbd1fb24be404a7c14
Author: cafreeman <[email protected]>
Date: 2015-02-27T20:04:22Z
Merge branch 'sparkr-sql' into dev
commit 3b4642980547714373ab1960cb9a096e2fcf233a
Author: Davies Liu <[email protected]>
Date: 2015-02-27T22:07:30Z
Merge branch 'master' of github.com:amplab-extras/SparkR-pkg into random
commit 5ef66fb8b03a635e309a5004a1b411b50f63ef9c
Author: Davies Liu <[email protected]>
Date: 2015-02-27T22:33:07Z
send back the port via temporary file
commit 2808dcfd2c0630625a5aa723cf0dbce642cd8f95
Author: cafreeman <[email protected]>
Date: 2015-02-27T23:54:17Z
Three more DataFrame methods
- `repartition`
- `distinct`
- `sampleDF`
commit cad0f0ca8c11ec5b3412b9926c92e89297a31b0a
Author: cafreeman <[email protected]>
Date: 2015-02-28T00:46:58Z
Fix docs and indents
commit 27dd3a09ce37d8afe385ccda35b425ac5655905c
Author: lythesia <[email protected]>
Date: 2015-02-28T02:00:41Z
modify tests for repartition
commit 889c265ee41f8faf3ee72e253cf019cb3a9a65a5
Author: cafreeman <[email protected]>
Date: 2015-02-28T02:08:18Z
numToInt utility function
Added `numToInt` converter function for allowing numeric arguments when
integers are required. Updated `repartition`.
commit 7b0d070bc0fd18e26d94dfd4dbcc500963faa5bb
Author: lythesia <[email protected]>
Date: 2015-02-28T02:10:35Z
keep partitions check
commit b0e7f731f4c64daac27a975a87b22c7276bbfe61
Author: cafreeman <[email protected]>
Date: 2015-02-28T02:28:08Z
Update `sampleDF` test
commit ad0935ef12fc6639a6ce45f1860d0f62c07ae838
Author: lythesia <[email protected]>
Date: 2015-02-28T02:50:34Z
minor fixes
commit 613464951add64f1f42a1bb814d86c0aa979cc18
Author: Shivaram Venkataraman <[email protected]>
Date: 2015-02-28T03:05:45Z
Merge pull request #187 from cafreeman/sparkr-sql
Three more DataFrame methods
commit 0346e5fc907aab71aef122e6ddc1b96f93d9abbf
Author: Davies Liu <[email protected]>
Date: 2015-02-28T07:05:42Z
address comment
commit a00f5029279ca1e14afb4f1b63d91e946bddfd73
Author: lythesia <[email protected]>
Date: 2015-02-28T07:43:58Z
fix indents
commit e425437d54493d2c687310eb54eb195f01b08252
Author: Shivaram Venkataraman <[email protected]>
Date: 2015-02-28T07:52:49Z
Merge pull request #177 from lythesia/master
[SPARKR-152] Support functions to change number of RDD partitions
(coalesce, repartition)
commit 5c72e73fb9e1971b66e359687807490a8fdc4d40
Author: Davies Liu <[email protected]>
Date: 2015-02-28T08:08:51Z
wait atmost 100 seconds
commit eb8ac119a0e266e656cbd3eeaf44c6722fd66045
Author: Shivaram Venkataraman <[email protected]>
Date: 2015-02-28T08:35:20Z
Set Spark version 1.3.0 in Windows build
commit abb4bb9da2cfc65ccc9d58f3e48cdf8e3ad20a68
Author: Davies Liu <[email protected]>
Date: 2015-02-28T08:38:16Z
add Column and expression
commit ae05bf1c1374e454c98f8a4de716b8d8970f46f3
Author: Davies Liu <[email protected]>
Date: 2015-02-28T08:42:19Z
Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into column
Conflicts:
pkg/R/utils.R
pkg/inst/tests/test_sparkSQL.R
commit 7b7248759c228fe8b0d9418447f8e1fd7f71b723
Author: hlin09 <[email protected]>
Date: 2015-03-01T17:20:37Z
Fix comments.
commit 3f57e56e3f67603bd2fda165370930fd39ad5117
Author: hlin09 <[email protected]>
Date: 2015-03-01T20:43:01Z
Fix comments.
commit 4d36ab10389a6bccb0385a519ce0ce36dfc46696
Author: hlin09 <[email protected]>
Date: 2015-03-01T21:33:53Z
Add tests for broadcast variables.
commit 7afa4c9d31fc3a7e9676a75ac51e0983708ccb1a
Author: Shivaram Venkataraman <[email protected]>
Date: 2015-03-01T22:44:59Z
Merge pull request #186 from hlin09/funcDep3
[SPARKR-142][SPARKR-196] (Step 2) Replaces getDependencies() with
cleanClosure to capture UDF closures and serialize them to worker.
commit 6e51c7ff25388bcf05776fa1ee353401b31b9443
Author: Shivaram Venkataraman <[email protected]>
Date: 2015-03-01T23:00:24Z
Fix stderr redirection on executors
commit 8c4deaedc570c2753a2103d59aba20178d9ef777
Author: Shivaram Venkataraman <[email protected]>
Date: 2015-03-01T23:06:29Z
Remove unused function
commit f7caeb84321f04291214f17a7a6606cb3a0ddee8
Author: Davies Liu <[email protected]>
Date: 2015-03-01T23:11:37Z
Update SparkRBackend.scala
commit b457833ea90575fb11840a18ff616f2d94be2aeb
Author: Shivaram Venkataraman <[email protected]>
Date: 2015-03-01T23:15:05Z
Merge pull request #189 from shivaram/stdErrFix
Fix stderr redirection on executors
commit 862f07c337705337ca8719485e6fe301a711bac7
Author: Shivaram Venkataraman <[email protected]>
Date: 2015-03-01T23:20:35Z
Merge pull request #190 from shivaram/SPARKR-79
[SPARKR-79] Remove unused function
commit 773baf064c923d3f44ea8fdbb5d2f36194245040
Author: Zongheng Yang <[email protected]>
Date: 2015-03-02T00:35:23Z
Merge pull request #178 from davies/random
[SPARKR-204] use random port in backend
commit 5c0bb24bd77a6e1ed4474144f14b6458cdd2c157
Author: Felix Cheung <[email protected]>
Date: 2015-03-02T06:20:41Z
Doc updates: build and running on YARN
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]