Re: [SparkR] - options around setting up SparkSession / SparkContext

2017-04-22 Thread Vin J
This is a Jupyter based environment where we would like to put off binding
a Spark session/context to the notebook until needed. In a YARN cluster
simply bootstrapping the Spark context/session will require a couple of
containers to be allocated which is wasteful unless the user really does
perform (optional) Spark processing.

I opened a JIRA https://issues.apache.org/jira/browse/SPARK-20440 and
attached PR 17731 to it as I think it better conveys both the problem and
solution.

Regards,
Vin.

On Sat, Apr 22, 2017 at 1:39 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> This seems some what unique. Most notebook environment, that I know of,
> has a preset processing engine tied to the notebook; in other words when
> Spark is selected as the engine then it is always initialized, not lazily
> as you describe.
>
> What is this notebook platform you use?
>
> _
> From: Vin J <winjos...@gmail.com>
> Sent: Saturday, April 22, 2017 12:33 AM
> Subject: Re: [SparkR] - options around setting up SparkSession /
> SparkContext
> To: Felix Cheung <felixcheun...@hotmail.com>
> Cc: <dev@spark.apache.org>
>
>
>
> This is for a notebook env that has the spark session/context bootstrapped
> for the user. There are settings that are user specific so not all of those
> can go into the spark-defaults.conf - such settings need to be dynamically
> applied when creating the session/context.
>
> In Scala/Python, I would bootstrap a "spark" handle similar to what
> spark-shell / psyspark-shell startup scripts do. In my case the
> bootstrapped object could be of a wrapper class that took care of whatever
> customization I needed while exposing the regular  SparkSession
> scala/python API. The user uses this object as he/she would use a regular
> SparkSession to submit work to the Spark cluster. Since I am certain there
> is no other way for users to perform Spark work except to go via the
> bootstrapped object, I can achieve my objective of delaying creation of
> SparkSession/Context until a call comes to my custom spark object.
>
> If I want to do the same in R, and let users write SparkR code as they
> normally would, but bootstrapping a SparkContext/Session for them, then I
> hit the issues as I explained earlier. There is no single entry point for
> SparkContext/Session in SparkR API and so to achieve lazy creation of
> SparkContext/session, it looks like the only  option is to do some trickery
> with the SparkR:::.sparkREnv$.sparkRjsc and SparkR:::.sparkREnv$.sparkRsession
> vars.
>
> Regards,
> Vin.
>
> On Sat, Apr 22, 2017 at 3:33 AM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> How would you handle this in Scala?
>>
>> If you are adding a wrapper func like getSparkSession for Scala, and have
>> your users call it, can't you do that same in SparkR? After all, while true
>> you don't need a SparkSession object to call the R API, someone still needs
>> to call sparkR.session() to initial the current session?
>>
>> Also what Spark environment you want to customize?
>>
>> Can these be set in environment variables or via spark-defaults.conf
>> spark.apache.org/docs/latest/configuration.html#dynamically-loading-
>> spark-properties
>>
>>
>> _
>> From: Vin J <winjos...@gmail.com>
>> Sent: Friday, April 21, 2017 2:22 PM
>> Subject: [SparkR] - options around setting up SparkSession / SparkContext
>> To: <dev@spark.apache.org>
>>
>>
>>
>>
>> I need to make an R environment available where the
>> SparkSession/SparkContext needs to be setup a specific way. The user simply
>> accesses this environment and executes his/her code. If the user code does
>> not access any Spark functions, I do not want to create a SparkContext
>> unnecessarily.
>>
>> In Scala/Python environments, the user can't access spark without first
>> referencing SparkContext / SparkSession classes. So the above (lazy and/or
>> custom SparkSession/Context creation) is easily met by offering
>> sparkContext/sparkSession handles to the user that are either wrappers on
>> Spark's classes or have lazy evaluation semantics. This way only when the
>> user accesses these handles to sparkContext/Session will the
>> SparkSession/Context actually get set up without the user needing to know
>> all the details about initing the SparkContext/Session.
>>
>> However, achieving the same doesn't appear to be so straightforward in R.
>> From what I see, executing sparkR.session(...) sets up private variables in
>> SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api
&g

Re: [SparkR] - options around setting up SparkSession / SparkContext

2017-04-22 Thread Felix Cheung
This seems some what unique. Most notebook environment, that I know of, has a 
preset processing engine tied to the notebook; in other words when Spark is 
selected as the engine then it is always initialized, not lazily as you 
describe.

What is this notebook platform you use?

_
From: Vin J <winjos...@gmail.com<mailto:winjos...@gmail.com>>
Sent: Saturday, April 22, 2017 12:33 AM
Subject: Re: [SparkR] - options around setting up SparkSession / SparkContext
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <dev@spark.apache.org<mailto:dev@spark.apache.org>>


This is for a notebook env that has the spark session/context bootstrapped for 
the user. There are settings that are user specific so not all of those can go 
into the spark-defaults.conf - such settings need to be dynamically applied 
when creating the session/context.

In Scala/Python, I would bootstrap a "spark" handle similar to what spark-shell 
/ psyspark-shell startup scripts do. In my case the bootstrapped object could 
be of a wrapper class that took care of whatever customization I needed while 
exposing the regular  SparkSession scala/python API. The user uses this object 
as he/she would use a regular SparkSession to submit work to the Spark cluster. 
Since I am certain there is no other way for users to perform Spark work except 
to go via the bootstrapped object, I can achieve my objective of delaying 
creation of SparkSession/Context until a call comes to my custom spark object.

If I want to do the same in R, and let users write SparkR code as they normally 
would, but bootstrapping a SparkContext/Session for them, then I hit the issues 
as I explained earlier. There is no single entry point for SparkContext/Session 
in SparkR API and so to achieve lazy creation of SparkContext/session, it looks 
like the only  option is to do some trickery with the 
SparkR:::.sparkREnv$.sparkRjsc and SparkR:::.sparkREnv$.sparkRsession vars.

Regards,
Vin.

On Sat, Apr 22, 2017 at 3:33 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
How would you handle this in Scala?

If you are adding a wrapper func like getSparkSession for Scala, and have your 
users call it, can't you do that same in SparkR? After all, while true you 
don't need a SparkSession object to call the R API, someone still needs to call 
sparkR.session() to initial the current session?

Also what Spark environment you want to customize?

Can these be set in environment variables or via spark-defaults.conf 
spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties<http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties>


_
From: Vin J <winjos...@gmail.com<mailto:winjos...@gmail.com>>
Sent: Friday, April 21, 2017 2:22 PM
Subject: [SparkR] - options around setting up SparkSession / SparkContext
To: <dev@spark.apache.org<mailto:dev@spark.apache.org>>




I need to make an R environment available where the SparkSession/SparkContext 
needs to be setup a specific way. The user simply accesses this environment and 
executes his/her code. If the user code does not access any Spark functions, I 
do not want to create a SparkContext unnecessarily.

In Scala/Python environments, the user can't access spark without first 
referencing SparkContext / SparkSession classes. So the above (lazy and/or 
custom SparkSession/Context creation) is easily met by offering 
sparkContext/sparkSession handles to the user that are either wrappers on 
Spark's classes or have lazy evaluation semantics. This way only when the user 
accesses these handles to sparkContext/Session will the SparkSession/Context 
actually get set up without the user needing to know all the details about 
initing the SparkContext/Session.

However, achieving the same doesn't appear to be so straightforward in R. From 
what I see, executing sparkR.session(...) sets up private variables in 
SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api works, a 
user doesn't need a handle to the spark session as such. Executing functions 
like so:  "df <- as.DataFrame(..)" implicitly access the private vars in 
SparkR:::.sparkREnv to get access to the sparkContext etc that are expected to 
have been created by a prior call to sparkR.session()/sparkR.init() etc.

Therefore, to inject any custom/lazy behavior into this I don't see a way 
except through having my code (that sits outside of Spark) apply a 
delayedAssign() or a makeActiveBinding( ) on SparkR:::.sparkRsession / 
.sparkRjsc  variables. This way when spark code internally references them, my 
wrapper/lazy code gets executed to do whatever I need done.

However, I am seeing some limitations of applying even this approach to SparkR 
- it will not work unless some minor changes are made i

Re: [SparkR] - options around setting up SparkSession / SparkContext

2017-04-22 Thread Vin J
This is for a notebook env that has the spark session/context bootstrapped
for the user. There are settings that are user specific so not all of those
can go into the spark-defaults.conf - such settings need to be dynamically
applied when creating the session/context.

In Scala/Python, I would bootstrap a "spark" handle similar to what
spark-shell / psyspark-shell startup scripts do. In my case the
bootstrapped object could be of a wrapper class that took care of whatever
customization I needed while exposing the regular  SparkSession
scala/python API. The user uses this object as he/she would use a regular
SparkSession to submit work to the Spark cluster. Since I am certain there
is no other way for users to perform Spark work except to go via the
bootstrapped object, I can achieve my objective of delaying creation of
SparkSession/Context until a call comes to my custom spark object.

If I want to do the same in R, and let users write SparkR code as they
normally would, but bootstrapping a SparkContext/Session for them, then I
hit the issues as I explained earlier. There is no single entry point for
SparkContext/Session in SparkR API and so to achieve lazy creation of
SparkContext/session, it looks like the only  option is to do some trickery
with the SparkR:::.sparkREnv$.sparkRjsc and
SparkR:::.sparkREnv$.sparkRsession vars.

Regards,
Vin.

On Sat, Apr 22, 2017 at 3:33 AM, Felix Cheung 
wrote:

> How would you handle this in Scala?
>
> If you are adding a wrapper func like getSparkSession for Scala, and have
> your users call it, can't you do that same in SparkR? After all, while true
> you don't need a SparkSession object to call the R API, someone still needs
> to call sparkR.session() to initial the current session?
>
> Also what Spark environment you want to customize?
>
> Can these be set in environment variables or via spark-defaults.conf
> spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-
> properties
>
>
> _
> From: Vin J 
> Sent: Friday, April 21, 2017 2:22 PM
> Subject: [SparkR] - options around setting up SparkSession / SparkContext
> To: 
>
>
>
>
> I need to make an R environment available where the
> SparkSession/SparkContext needs to be setup a specific way. The user simply
> accesses this environment and executes his/her code. If the user code does
> not access any Spark functions, I do not want to create a SparkContext
> unnecessarily.
>
> In Scala/Python environments, the user can't access spark without first
> referencing SparkContext / SparkSession classes. So the above (lazy and/or
> custom SparkSession/Context creation) is easily met by offering
> sparkContext/sparkSession handles to the user that are either wrappers on
> Spark's classes or have lazy evaluation semantics. This way only when the
> user accesses these handles to sparkContext/Session will the
> SparkSession/Context actually get set up without the user needing to know
> all the details about initing the SparkContext/Session.
>
> However, achieving the same doesn't appear to be so straightforward in R.
> From what I see, executing sparkR.session(...) sets up private variables in
> SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api
> works, a user doesn't need a handle to the spark session as such. Executing
> functions like so:  "df <- as.DataFrame(..)" implicitly access the private
> vars in SparkR:::.sparkREnv to get access to the sparkContext etc that are
> expected to have been created by a prior call to
> sparkR.session()/sparkR.init() etc.
>
> Therefore, to inject any custom/lazy behavior into this I don't see a way
> except through having my code (that sits outside of Spark) apply a
> delayedAssign() or a makeActiveBinding( ) on SparkR:::.sparkRsession /
> .sparkRjsc  variables. This way when spark code internally references them,
> my wrapper/lazy code gets executed to do whatever I need done.
>
> However, I am seeing some limitations of applying even this approach to
> SparkR - it will not work unless some minor changes are made in the SparkR
> code. But, before I opened a PR that would do these changes in SparkR I
> wanted to check if there was a better way to achieve this? I am far less
> than an R expert, and could be missing something here.
>
> If you'd rather see this in a JIRA and a PR, let me know and I'll go ahead
> and open one.
>
> Regards,
> Vin.
>
>
>
>
>


Re: [SparkR] - options around setting up SparkSession / SparkContext

2017-04-21 Thread Felix Cheung
How would you handle this in Scala?

If you are adding a wrapper func like getSparkSession for Scala, and have your 
users call it, can't you do that same in SparkR? After all, while true you 
don't need a SparkSession object to call the R API, someone still needs to call 
sparkR.session() to initial the current session?

Also what Spark environment you want to customize?

Can these be set in environment variables or via spark-defaults.conf 
spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties


_
From: Vin J >
Sent: Friday, April 21, 2017 2:22 PM
Subject: [SparkR] - options around setting up SparkSession / SparkContext
To: >



I need to make an R environment available where the SparkSession/SparkContext 
needs to be setup a specific way. The user simply accesses this environment and 
executes his/her code. If the user code does not access any Spark functions, I 
do not want to create a SparkContext unnecessarily.

In Scala/Python environments, the user can't access spark without first 
referencing SparkContext / SparkSession classes. So the above (lazy and/or 
custom SparkSession/Context creation) is easily met by offering 
sparkContext/sparkSession handles to the user that are either wrappers on 
Spark's classes or have lazy evaluation semantics. This way only when the user 
accesses these handles to sparkContext/Session will the SparkSession/Context 
actually get set up without the user needing to know all the details about 
initing the SparkContext/Session.

However, achieving the same doesn't appear to be so straightforward in R. From 
what I see, executing sparkR.session(...) sets up private variables in 
SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api works, a 
user doesn't need a handle to the spark session as such. Executing functions 
like so:  "df <- as.DataFrame(..)" implicitly access the private vars in 
SparkR:::.sparkREnv to get access to the sparkContext etc that are expected to 
have been created by a prior call to sparkR.session()/sparkR.init() etc.

Therefore, to inject any custom/lazy behavior into this I don't see a way 
except through having my code (that sits outside of Spark) apply a 
delayedAssign() or a makeActiveBinding( ) on SparkR:::.sparkRsession / 
.sparkRjsc  variables. This way when spark code internally references them, my 
wrapper/lazy code gets executed to do whatever I need done.

However, I am seeing some limitations of applying even this approach to SparkR 
- it will not work unless some minor changes are made in the SparkR code. But, 
before I opened a PR that would do these changes in SparkR I wanted to check if 
there was a better way to achieve this? I am far less than an R expert, and 
could be missing something here.

If you'd rather see this in a JIRA and a PR, let me know and I'll go ahead and 
open one.

Regards,
Vin.