Re: why zeppelin SparkInterpreter use FIFOScheduler

IT CTO Sun, 16 Aug 2015 00:56:26 -0700

+1 for "to re-factor the Zeppelin architecture so that it can handle
multi-tenancy easily"


On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <[email protected]> wrote:

> Agree with Joel, we may think to re-factor the Zeppelin architecture so
> that it can handle multi-tenancy easily. The technical solution proposed by 
> Pranav
> is great but it only applies to Spark. Right now, each interpreter has to
> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
> multi-tenancy contract/info (like UserContext, similar to
> InterpreterContext) so that each interpreter can choose to use or not.
>
>
> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <[email protected]> wrote:
>
>> I think while the idea of running multiple notes simultaneously is great.
>> It is really dancing around the lack of true multi user support in
>> Zeppelin. While the proposed solution would work if the applications
>> resources are those of the whole cluster, if the app is limited (say they
>> are 8 cores of 16, with some distribution in memory) then potentially your
>> note can hog all the resources and the scheduler will have to throttle all
>> other executions leaving you exactly where you are now.
>> While I think the solution is a good one, maybe this question makes us
>> think in adding true multiuser support.
>> Where we isolate resources (cluster and the notebooks themselves), have
>> separate login/identity and (I don't know if it's possible) share the same
>> context.
>>
>> Thanks,
>> Joel
>>
>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <[email protected]>
>> wrote:
>> >
>> > If the problem is that multiple users have to wait for each other while
>> > using Zeppelin, the solution already exists: they can create a new
>> > interpreter by going to the interpreter page and attach it to their
>> > notebook - then they don't have to wait for others to submit their job.
>> >
>> > But I agree, having paragraphs from one note wait for paragraphs from
>> other
>> > notes is a confusing default. We can get around that in two ways:
>> >
>> >   1. Create a new interpreter for each note and attach that interpreter
>> to
>> >   that note. This approach would require the least amount of code
>> changes but
>> >   is resource heavy and doesn't let you share Spark Context between
>> different
>> >   notes.
>> >   2. If we want to share the Spark Context between different notes, we
>> can
>> >   submit jobs from different notes into different fairscheduler pools (
>> >
>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>> ).
>> >   This can be done by submitting jobs from different notes in different
>> >   threads. This will make sure that jobs from one note are run
>> sequentially
>> >   but jobs from different notes will be able to run in parallel.
>> >
>> > Neither of these options require any change in the Spark code.
>> >
>> > --
>> > Thanks & Regards
>> > Rohit Agarwal
>> > https://www.linkedin.com/in/rohitagarwal003
>> >
>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>> [email protected]>
>> > wrote:
>> >
>> >> If someone can share about the idea of sharing single SparkContext
>> through
>> >>> multiple SparkILoop safely, it'll be really helpful.
>> >> Here is a proposal:
>> >> 1. In Spark code, change SparkIMain.scala to allow setting the virtual
>> >> directory. While creating new instances of SparkIMain per notebook from
>> >> zeppelin spark interpreter set all the instances of SparkIMain to the
>> same
>> >> virtual directory.
>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>> server in
>> >> Spark Context using classserverUri method
>> >> 3. Scala generated code has a notion of packages. The default package
>> name
>> >> is "line$<linenumber>". Package name can be controlled using System
>> >> Property scala.repl.name.line. Setting this property to "notebook id"
>> >> ensures that code generated by individual instances of SparkIMain is
>> >> isolated from other instances of SparkIMain
>> >> 4. Build a queue inside interpreter to allow only one paragraph
>> execution
>> >> at a time per notebook.
>> >>
>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>> >> classnames. I'll work towards submitting a formal patch soon - Is
>> there any
>> >> Jira already for the same that I can uptake? Also I need to understand:
>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>> >> towards getting Spark changes merged in Apache Spark github?
>> >>
>> >> Any suggestions on comments on the proposal are highly welcome.
>> >>
>> >> Regards,
>> >> -Pranav.
>> >>
>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>> >>>
>> >>> Hi piyush,
>> >>>
>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>> >>> sharing the SparkContext sounds great.
>> >>>
>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>> could
>> >>> generates the same class name, and spark executor confuses classname
>> since
>> >>> they're reading classes from single SparkContext.
>> >>>
>> >>> If someone can share about the idea of sharing single SparkContext
>> >>> through multiple SparkILoop safely, it'll be really helpful.
>> >>>
>> >>> Thanks,
>> >>> moon
>> >>>
>> >>>
>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>> >>> [email protected] <mailto:[email protected]>>
>> wrote:
>> >>>
>> >>>    Hi Moon,
>> >>>    Any suggestion on it, have to wait lot when multiple people
>> working
>> >>> with spark.
>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>> >>> printstrems  for each notebook while sharing theSparkContext
>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>> parallel
>> >>> scheduler ?
>> >>>    thanks
>> >>>
>> >>>    -piyush
>> >>>
>> >>>    Hi Moon,
>> >>>
>> >>>    How about tracking dedicated SparkContext for a notebook in Spark's
>> >>>    remote interpreter - this will allow multiple users to run their
>> spark
>> >>>    paragraphs in parallel. Also, within a notebook only one paragraph
>> is
>> >>>    executed at a time.
>> >>>
>> >>>    Regards,
>> >>>    -Pranav.
>> >>>
>> >>>
>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>> >>>> Hi,
>> >>>>
>> >>>> Thanks for asking question.
>> >>>>
>> >>>> The reason is simply because of it is running code statements. The
>> >>>> statements can have order and dependency. Imagine i have two
>> >>> paragraphs
>> >>>>
>> >>>> %spark
>> >>>> val a = 1
>> >>>>
>> >>>> %spark
>> >>>> print(a)
>> >>>>
>> >>>> If they're not running one by one, that means they possibly runs in
>> >>>> random order and the output will be always different. Either '1' or
>> >>>> 'val a can not found'.
>> >>>>
>> >>>> This is the reason why. But if there are nice idea to handle this
>> >>>> problem i agree using parallel scheduler would help a lot.
>> >>>>
>> >>>> Thanks,
>> >>>> moon
>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>> >>>> <[email protected]  <mailto:[email protected]>
>> >>> <mailto:[email protected]  <mailto:[email protected]>>>
>> >>> wrote:
>> >>>>
>> >>>>    any one who have the same question with me? or this is not a
>> >>> question?
>> >>>>
>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <[email protected]
>> >>> <mailto:[email protected]>
>> >>>>    <mailto:[email protected]  <mailto:
>> >>> [email protected]>>>:
>> >>>>
>> >>>>        hi, Moon:
>> >>>>           I notice that the getScheduler function in the
>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes the
>> >>>>        spark interpreter run spark job one by one. It's not a good
>> >>>>        experience when couple of users do some work on zeppelin at
>> >>>>        the same time, because they have to wait for each other.
>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>> >>>>        My question is, what kind of consideration do you based on
>> >>> to
>> >>>>        make such a decision?
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> ------------------------------------------------------------------------------------------------------------------------------------------
>> >>>
>> >>>    This email and any files transmitted with it are confidential and
>> >>>    intended solely for the use of the individual or entity to whom
>> >>>    they are addressed. If you have received this email in error
>> >>>    please notify the system manager. This message contains
>> >>>    confidential information and is intended only for the individual
>> >>>    named. If you are not the named addressee you should not
>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>> >>>    sender immediately by e-mail if you have received this e-mail by
>> >>>    mistake and delete this e-mail from your system. If you are not
>> >>>    the intended recipient you are notified that disclosing, copying,
>> >>>    distributing or taking any action in reliance on the contents of
>> >>>    this information is strictly prohibited. Although Flipkart has
>> >>>    taken reasonable precautions to ensure no viruses are present in
>> >>>    this email, the company cannot accept responsibility for any loss
>> >>>    or damage arising from the use of this email or attachments
>> >>
>>
>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Reply via email to