Re: why zeppelin SparkInterpreter use FIFOScheduler

DuyHai Doan Sat, 15 Aug 2015 23:48:08 -0700

Agree with Joel, we may think to re-factor the Zeppelin architecture so
that it can handle multi-tenancy easily. The technical solution
proposed by Pranav
is great but it only applies to Spark. Right now, each interpreter has to
manage multi-tenancy its own way. Ultimately Zeppelin can propose a
multi-tenancy contract/info (like UserContext, similar to
InterpreterContext) so that each interpreter can choose to use or not.



On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <[email protected]> wrote:

> I think while the idea of running multiple notes simultaneously is great.
> It is really dancing around the lack of true multi user support in
> Zeppelin. While the proposed solution would work if the applications
> resources are those of the whole cluster, if the app is limited (say they
> are 8 cores of 16, with some distribution in memory) then potentially your
> note can hog all the resources and the scheduler will have to throttle all
> other executions leaving you exactly where you are now.
> While I think the solution is a good one, maybe this question makes us
> think in adding true multiuser support.
> Where we isolate resources (cluster and the notebooks themselves), have
> separate login/identity and (I don't know if it's possible) share the same
> context.
>
> Thanks,
> Joel
>
> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <[email protected]> wrote:
> >
> > If the problem is that multiple users have to wait for each other while
> > using Zeppelin, the solution already exists: they can create a new
> > interpreter by going to the interpreter page and attach it to their
> > notebook - then they don't have to wait for others to submit their job.
> >
> > But I agree, having paragraphs from one note wait for paragraphs from
> other
> > notes is a confusing default. We can get around that in two ways:
> >
> >   1. Create a new interpreter for each note and attach that interpreter
> to
> >   that note. This approach would require the least amount of code
> changes but
> >   is resource heavy and doesn't let you share Spark Context between
> different
> >   notes.
> >   2. If we want to share the Spark Context between different notes, we
> can
> >   submit jobs from different notes into different fairscheduler pools (
> >
> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
> ).
> >   This can be done by submitting jobs from different notes in different
> >   threads. This will make sure that jobs from one note are run
> sequentially
> >   but jobs from different notes will be able to run in parallel.
> >
> > Neither of these options require any change in the Spark code.
> >
> > --
> > Thanks & Regards
> > Rohit Agarwal
> > https://www.linkedin.com/in/rohitagarwal003
> >
> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
> [email protected]>
> > wrote:
> >
> >> If someone can share about the idea of sharing single SparkContext
> through
> >>> multiple SparkILoop safely, it'll be really helpful.
> >> Here is a proposal:
> >> 1. In Spark code, change SparkIMain.scala to allow setting the virtual
> >> directory. While creating new instances of SparkIMain per notebook from
> >> zeppelin spark interpreter set all the instances of SparkIMain to the
> same
> >> virtual directory.
> >> 2. Start HTTP server on that virtual directory and set this HTTP server
> in
> >> Spark Context using classserverUri method
> >> 3. Scala generated code has a notion of packages. The default package
> name
> >> is "line$<linenumber>". Package name can be controlled using System
> >> Property scala.repl.name.line. Setting this property to "notebook id"
> >> ensures that code generated by individual instances of SparkIMain is
> >> isolated from other instances of SparkIMain
> >> 4. Build a queue inside interpreter to allow only one paragraph
> execution
> >> at a time per notebook.
> >>
> >> I have tested 1, 2, and 3 and it seems to provide isolation across
> >> classnames. I'll work towards submitting a formal patch soon - Is there
> any
> >> Jira already for the same that I can uptake? Also I need to understand:
> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
> >> towards getting Spark changes merged in Apache Spark github?
> >>
> >> Any suggestions on comments on the proposal are highly welcome.
> >>
> >> Regards,
> >> -Pranav.
> >>
> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
> >>>
> >>> Hi piyush,
> >>>
> >>> Separate instance of SparkILoop SparkIMain for each notebook while
> >>> sharing the SparkContext sounds great.
> >>>
> >>> Actually, i tried to do it, found problem that multiple SparkILoop
> could
> >>> generates the same class name, and spark executor confuses classname
> since
> >>> they're reading classes from single SparkContext.
> >>>
> >>> If someone can share about the idea of sharing single SparkContext
> >>> through multiple SparkILoop safely, it'll be really helpful.
> >>>
> >>> Thanks,
> >>> moon
> >>>
> >>>
> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
> >>> [email protected] <mailto:[email protected]>> wrote:
> >>>
> >>>    Hi Moon,
> >>>    Any suggestion on it, have to wait lot when multiple people  working
> >>> with spark.
> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
> >>> printstrems  for each notebook while sharing theSparkContext
> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
> parallel
> >>> scheduler ?
> >>>    thanks
> >>>
> >>>    -piyush
> >>>
> >>>    Hi Moon,
> >>>
> >>>    How about tracking dedicated SparkContext for a notebook in Spark's
> >>>    remote interpreter - this will allow multiple users to run their
> spark
> >>>    paragraphs in parallel. Also, within a notebook only one paragraph
> is
> >>>    executed at a time.
> >>>
> >>>    Regards,
> >>>    -Pranav.
> >>>
> >>>
> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
> >>>> Hi,
> >>>>
> >>>> Thanks for asking question.
> >>>>
> >>>> The reason is simply because of it is running code statements. The
> >>>> statements can have order and dependency. Imagine i have two
> >>> paragraphs
> >>>>
> >>>> %spark
> >>>> val a = 1
> >>>>
> >>>> %spark
> >>>> print(a)
> >>>>
> >>>> If they're not running one by one, that means they possibly runs in
> >>>> random order and the output will be always different. Either '1' or
> >>>> 'val a can not found'.
> >>>>
> >>>> This is the reason why. But if there are nice idea to handle this
> >>>> problem i agree using parallel scheduler would help a lot.
> >>>>
> >>>> Thanks,
> >>>> moon
> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
> >>>> <[email protected]  <mailto:[email protected]>
> >>> <mailto:[email protected]  <mailto:[email protected]>>>
> >>> wrote:
> >>>>
> >>>>    any one who have the same question with me? or this is not a
> >>> question?
> >>>>
> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <[email protected]
> >>> <mailto:[email protected]>
> >>>>    <mailto:[email protected]  <mailto:
> >>> [email protected]>>>:
> >>>>
> >>>>        hi, Moon:
> >>>>           I notice that the getScheduler function in the
> >>>>        SparkInterpreter.java return a FIFOScheduler which makes the
> >>>>        spark interpreter run spark job one by one. It's not a good
> >>>>        experience when couple of users do some work on zeppelin at
> >>>>        the same time, because they have to wait for each other.
> >>>>        And at the same time, SparkSqlInterpreter can chose what
> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
> >>>>        My question is, what kind of consideration do you based on
> >>> to
> >>>>        make such a decision?
> >>>
> >>>
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------------------------------------------------------------------
> >>>
> >>>    This email and any files transmitted with it are confidential and
> >>>    intended solely for the use of the individual or entity to whom
> >>>    they are addressed. If you have received this email in error
> >>>    please notify the system manager. This message contains
> >>>    confidential information and is intended only for the individual
> >>>    named. If you are not the named addressee you should not
> >>>    disseminate, distribute or copy this e-mail. Please notify the
> >>>    sender immediately by e-mail if you have received this e-mail by
> >>>    mistake and delete this e-mail from your system. If you are not
> >>>    the intended recipient you are notified that disclosing, copying,
> >>>    distributing or taking any action in reliance on the contents of
> >>>    this information is strictly prohibited. Although Flipkart has
> >>>    taken reasonable precautions to ensure no viruses are present in
> >>>    this email, the company cannot accept responsibility for any loss
> >>>    or damage arising from the use of this email or attachments
> >>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Reply via email to