Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Tobias Pfeiffer
Hi,

also there is Spindle  which was
introduced on this list some time ago. I haven't looked into it deeply, but
you might gain some valuable insights from their architecture, they are
also using Spark to fulfill requests coming from the web.

Tobias


RE: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Mohammed Guller
David,

Here is what I would suggest:

1 - Does a new SparkContext get created in the web tier for each new request
for processing?
Create a single SparkContext that gets shared across multiple web requests. 
Depending on the framework that you are using for the web-tier, it should not 
be difficult to create a global singleton object that  holds the SparkContext.

2 - If so, how much time should we expect it to take for setting up the
context?  Our goal is to return a response to the users in under 10 seconds,
but if it takes many seconds to create a new context or otherwise set up the
job, then we need to adjust our expectations for what is possible.  From
using spark-shell one might conclude that it might take more than 10 seconds
to create a context, however it's not clear how much of that is
context-creation vs other things.

3 - (This last question perhaps deserves a post in and of itself:) if every
job is always comparing some little data structure to the same HDFS corpus
of data, what is the best pattern to use to cache the RDD's from HDFS so
they don't have to always be re-constituted from disk?  I.e. how can RDD's
be "shared" from the context of one job to the context of subsequent jobs?
Or does something like memcache have to be used?
Create a cached RDD in a global singleton object, which gets accessed by 
multiple web requests. You could put the cached RDD in the same object that 
holds the SparkContext, if you would like. I need to know more details about 
the specifics of your application to be more specific, but hopefully you get 
the idea.


Mohammed

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Tuesday, November 11, 2014 8:54 AM
To: Sonal Goyal
Cc: bethesda; u...@spark.incubator.apache.org
Subject: Re: Best practice for multi-user web controller in front of Spark

For sharing RDDs across multiple jobs - you could also have a look at Tachyon. 
It provides an HDFS compatible in-memory storage layer that keeps data in 
memory across multiple jobs/frameworks - http://tachyon-project.org/.

-

On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal 
mailto:sonalgoy...@gmail.com>> wrote:
I believe the Spark Job Server by Ooyala can help you share data across 
multiple jobs, take a look at 
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It seems 
to fit closely to what you need.

Best Regards,
Sonal
Founder, Nube Technologies<http://www.nubetech.co>




On Tue, Nov 11, 2014 at 7:20 PM, bethesda 
mailto:swearinge...@mac.com>> wrote:
We are relatively new to spark and so far have been manually submitting
single jobs at a time for ML training, during our development process, using
spark-submit.  Each job accepts a small user-submitted data set and compares
it to every data set in our hdfs corpus, which only changes incrementally on
a daily basis.  (that detail is relevant to question 3 below)

Now we are ready to start building out the front-end, which will allow a
team of data scientists to submit their problems to the system via a web
front-end (web tier will be java).  Users could of course be submitting jobs
more or less simultaneously.  We want to make sure we understand how to best
structure this.

Questions:

1 - Does a new SparkContext get created in the web tier for each new request
for processing?

2 - If so, how much time should we expect it to take for setting up the
context?  Our goal is to return a response to the users in under 10 seconds,
but if it takes many seconds to create a new context or otherwise set up the
job, then we need to adjust our expectations for what is possible.  From
using spark-shell one might conclude that it might take more than 10 seconds
to create a context, however it's not clear how much of that is
context-creation vs other things.

3 - (This last question perhaps deserves a post in and of itself:) if every
job is always comparing some little data structure to the same HDFS corpus
of data, what is the best pattern to use to cache the RDD's from HDFS so
they don't have to always be re-constituted from disk?  I.e. how can RDD's
be "shared" from the context of one job to the context of subsequent jobs?
Or does something like memcache have to be used?

Thanks!
David



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>




Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Evan R. Sparks
For sharing RDDs across multiple jobs - you could also have a look at
Tachyon. It provides an HDFS compatible in-memory storage layer that keeps
data in memory across multiple jobs/frameworks - http://tachyon-project.org/
.

-

On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal  wrote:

> I believe the Spark Job Server by Ooyala can help you share data across
> multiple jobs, take a look at
> http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It
> seems to fit closely to what you need.
>
> Best Regards,
> Sonal
> Founder, Nube Technologies 
>
> 
>
>
>
> On Tue, Nov 11, 2014 at 7:20 PM, bethesda  wrote:
>
>> We are relatively new to spark and so far have been manually submitting
>> single jobs at a time for ML training, during our development process,
>> using
>> spark-submit.  Each job accepts a small user-submitted data set and
>> compares
>> it to every data set in our hdfs corpus, which only changes incrementally
>> on
>> a daily basis.  (that detail is relevant to question 3 below)
>>
>> Now we are ready to start building out the front-end, which will allow a
>> team of data scientists to submit their problems to the system via a web
>> front-end (web tier will be java).  Users could of course be submitting
>> jobs
>> more or less simultaneously.  We want to make sure we understand how to
>> best
>> structure this.
>>
>> Questions:
>>
>> 1 - Does a new SparkContext get created in the web tier for each new
>> request
>> for processing?
>>
>> 2 - If so, how much time should we expect it to take for setting up the
>> context?  Our goal is to return a response to the users in under 10
>> seconds,
>> but if it takes many seconds to create a new context or otherwise set up
>> the
>> job, then we need to adjust our expectations for what is possible.  From
>> using spark-shell one might conclude that it might take more than 10
>> seconds
>> to create a context, however it's not clear how much of that is
>> context-creation vs other things.
>>
>> 3 - (This last question perhaps deserves a post in and of itself:) if
>> every
>> job is always comparing some little data structure to the same HDFS corpus
>> of data, what is the best pattern to use to cache the RDD's from HDFS so
>> they don't have to always be re-constituted from disk?  I.e. how can RDD's
>> be "shared" from the context of one job to the context of subsequent jobs?
>> Or does something like memcache have to be used?
>>
>> Thanks!
>> David
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Sonal Goyal
I believe the Spark Job Server by Ooyala can help you share data across
multiple jobs, take a look at
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It
seems to fit closely to what you need.

Best Regards,
Sonal
Founder, Nube Technologies 





On Tue, Nov 11, 2014 at 7:20 PM, bethesda  wrote:

> We are relatively new to spark and so far have been manually submitting
> single jobs at a time for ML training, during our development process,
> using
> spark-submit.  Each job accepts a small user-submitted data set and
> compares
> it to every data set in our hdfs corpus, which only changes incrementally
> on
> a daily basis.  (that detail is relevant to question 3 below)
>
> Now we are ready to start building out the front-end, which will allow a
> team of data scientists to submit their problems to the system via a web
> front-end (web tier will be java).  Users could of course be submitting
> jobs
> more or less simultaneously.  We want to make sure we understand how to
> best
> structure this.
>
> Questions:
>
> 1 - Does a new SparkContext get created in the web tier for each new
> request
> for processing?
>
> 2 - If so, how much time should we expect it to take for setting up the
> context?  Our goal is to return a response to the users in under 10
> seconds,
> but if it takes many seconds to create a new context or otherwise set up
> the
> job, then we need to adjust our expectations for what is possible.  From
> using spark-shell one might conclude that it might take more than 10
> seconds
> to create a context, however it's not clear how much of that is
> context-creation vs other things.
>
> 3 - (This last question perhaps deserves a post in and of itself:) if every
> job is always comparing some little data structure to the same HDFS corpus
> of data, what is the best pattern to use to cache the RDD's from HDFS so
> they don't have to always be re-constituted from disk?  I.e. how can RDD's
> be "shared" from the context of one job to the context of subsequent jobs?
> Or does something like memcache have to be used?
>
> Thanks!
> David
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>