Re: [SparkR] - options around setting up SparkSession / SparkContext

Felix Cheung Fri, 21 Apr 2017 15:04:14 -0700

How would you handle this in Scala?

If you are adding a wrapper func like getSparkSession for Scala, and have your 
users call it, can't you do that same in SparkR? After all, while true you 
don't need a SparkSession object to call the R API, someone still needs to call 
sparkR.session() to initial the current session?


Also what Spark environment you want to customize?

Can these be set in environment variables or via spark-defaults.conf 
spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties<http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties>


_____________________________
From: Vin J <[email protected]<mailto:[email protected]>>
Sent: Friday, April 21, 2017 2:22 PM
Subject: [SparkR] - options around setting up SparkSession / SparkContext
To: <[email protected]<mailto:[email protected]>>



I need to make an R environment available where the SparkSession/SparkContext 
needs to be setup a specific way. The user simply accesses this environment and 
executes his/her code. If the user code does not access any Spark functions, I 
do not want to create a SparkContext unnecessarily.

In Scala/Python environments, the user can't access spark without first 
referencing SparkContext / SparkSession classes. So the above (lazy and/or 
custom SparkSession/Context creation) is easily met by offering 
sparkContext/sparkSession handles to the user that are either wrappers on 
Spark's classes or have lazy evaluation semantics. This way only when the user 
accesses these handles to sparkContext/Session will the SparkSession/Context 
actually get set up without the user needing to know all the details about 
initing the SparkContext/Session.

However, achieving the same doesn't appear to be so straightforward in R. From 
what I see, executing sparkR.session(...) sets up private variables in 
SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api works, a 
user doesn't need a handle to the spark session as such. Executing functions 
like so:  "df <- as.DataFrame(..)" implicitly access the private vars in 
SparkR:::.sparkREnv to get access to the sparkContext etc that are expected to 
have been created by a prior call to sparkR.session()/sparkR.init() etc.

Therefore, to inject any custom/lazy behavior into this I don't see a way 
except through having my code (that sits outside of Spark) apply a 
delayedAssign() or a makeActiveBinding( ) on SparkR:::.sparkRsession / 
.sparkRjsc  variables. This way when spark code internally references them, my 
wrapper/lazy code gets executed to do whatever I need done.

However, I am seeing some limitations of applying even this approach to SparkR 
- it will not work unless some minor changes are made in the SparkR code. But, 
before I opened a PR that would do these changes in SparkR I wanted to check if 
there was a better way to achieve this? I am far less than an R expert, and 
could be missing something here.

If you'd rather see this in a JIRA and a PR, let me know and I'll go ahead and 
open one.

Regards,
Vin.

Re: [SparkR] - options around setting up SparkSession / SparkContext

Reply via email to