+1 for "to re-factor the Zeppelin architecture so that it can handle multi-tenancy easily"
On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <[email protected]> wrote: > Agree with Joel, we may think to re-factor the Zeppelin architecture so > that it can handle multi-tenancy easily. The technical solution proposed by > Pranav > is great but it only applies to Spark. Right now, each interpreter has to > manage multi-tenancy its own way. Ultimately Zeppelin can propose a > multi-tenancy contract/info (like UserContext, similar to > InterpreterContext) so that each interpreter can choose to use or not. > > > On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <[email protected]> wrote: > >> I think while the idea of running multiple notes simultaneously is great. >> It is really dancing around the lack of true multi user support in >> Zeppelin. While the proposed solution would work if the applications >> resources are those of the whole cluster, if the app is limited (say they >> are 8 cores of 16, with some distribution in memory) then potentially your >> note can hog all the resources and the scheduler will have to throttle all >> other executions leaving you exactly where you are now. >> While I think the solution is a good one, maybe this question makes us >> think in adding true multiuser support. >> Where we isolate resources (cluster and the notebooks themselves), have >> separate login/identity and (I don't know if it's possible) share the same >> context. >> >> Thanks, >> Joel >> >> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <[email protected]> >> wrote: >> > >> > If the problem is that multiple users have to wait for each other while >> > using Zeppelin, the solution already exists: they can create a new >> > interpreter by going to the interpreter page and attach it to their >> > notebook - then they don't have to wait for others to submit their job. >> > >> > But I agree, having paragraphs from one note wait for paragraphs from >> other >> > notes is a confusing default. We can get around that in two ways: >> > >> > 1. Create a new interpreter for each note and attach that interpreter >> to >> > that note. This approach would require the least amount of code >> changes but >> > is resource heavy and doesn't let you share Spark Context between >> different >> > notes. >> > 2. If we want to share the Spark Context between different notes, we >> can >> > submit jobs from different notes into different fairscheduler pools ( >> > >> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application >> ). >> > This can be done by submitting jobs from different notes in different >> > threads. This will make sure that jobs from one note are run >> sequentially >> > but jobs from different notes will be able to run in parallel. >> > >> > Neither of these options require any change in the Spark code. >> > >> > -- >> > Thanks & Regards >> > Rohit Agarwal >> > https://www.linkedin.com/in/rohitagarwal003 >> > >> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal < >> [email protected]> >> > wrote: >> > >> >> If someone can share about the idea of sharing single SparkContext >> through >> >>> multiple SparkILoop safely, it'll be really helpful. >> >> Here is a proposal: >> >> 1. In Spark code, change SparkIMain.scala to allow setting the virtual >> >> directory. While creating new instances of SparkIMain per notebook from >> >> zeppelin spark interpreter set all the instances of SparkIMain to the >> same >> >> virtual directory. >> >> 2. Start HTTP server on that virtual directory and set this HTTP >> server in >> >> Spark Context using classserverUri method >> >> 3. Scala generated code has a notion of packages. The default package >> name >> >> is "line$<linenumber>". Package name can be controlled using System >> >> Property scala.repl.name.line. Setting this property to "notebook id" >> >> ensures that code generated by individual instances of SparkIMain is >> >> isolated from other instances of SparkIMain >> >> 4. Build a queue inside interpreter to allow only one paragraph >> execution >> >> at a time per notebook. >> >> >> >> I have tested 1, 2, and 3 and it seems to provide isolation across >> >> classnames. I'll work towards submitting a formal patch soon - Is >> there any >> >> Jira already for the same that I can uptake? Also I need to understand: >> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work >> >> towards getting Spark changes merged in Apache Spark github? >> >> >> >> Any suggestions on comments on the proposal are highly welcome. >> >> >> >> Regards, >> >> -Pranav. >> >> >> >>> On 10/08/15 11:36 pm, moon soo Lee wrote: >> >>> >> >>> Hi piyush, >> >>> >> >>> Separate instance of SparkILoop SparkIMain for each notebook while >> >>> sharing the SparkContext sounds great. >> >>> >> >>> Actually, i tried to do it, found problem that multiple SparkILoop >> could >> >>> generates the same class name, and spark executor confuses classname >> since >> >>> they're reading classes from single SparkContext. >> >>> >> >>> If someone can share about the idea of sharing single SparkContext >> >>> through multiple SparkILoop safely, it'll be really helpful. >> >>> >> >>> Thanks, >> >>> moon >> >>> >> >>> >> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) < >> >>> [email protected] <mailto:[email protected]>> >> wrote: >> >>> >> >>> Hi Moon, >> >>> Any suggestion on it, have to wait lot when multiple people >> working >> >>> with spark. >> >>> Can we create separate instance of SparkILoop SparkIMain and >> >>> printstrems for each notebook while sharing theSparkContext >> >>> ZeppelinContext SQLContext and DependencyResolver and then use >> parallel >> >>> scheduler ? >> >>> thanks >> >>> >> >>> -piyush >> >>> >> >>> Hi Moon, >> >>> >> >>> How about tracking dedicated SparkContext for a notebook in Spark's >> >>> remote interpreter - this will allow multiple users to run their >> spark >> >>> paragraphs in parallel. Also, within a notebook only one paragraph >> is >> >>> executed at a time. >> >>> >> >>> Regards, >> >>> -Pranav. >> >>> >> >>> >> >>>> On 15/07/15 7:15 pm, moon soo Lee wrote: >> >>>> Hi, >> >>>> >> >>>> Thanks for asking question. >> >>>> >> >>>> The reason is simply because of it is running code statements. The >> >>>> statements can have order and dependency. Imagine i have two >> >>> paragraphs >> >>>> >> >>>> %spark >> >>>> val a = 1 >> >>>> >> >>>> %spark >> >>>> print(a) >> >>>> >> >>>> If they're not running one by one, that means they possibly runs in >> >>>> random order and the output will be always different. Either '1' or >> >>>> 'val a can not found'. >> >>>> >> >>>> This is the reason why. But if there are nice idea to handle this >> >>>> problem i agree using parallel scheduler would help a lot. >> >>>> >> >>>> Thanks, >> >>>> moon >> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng >> >>>> <[email protected] <mailto:[email protected]> >> >>> <mailto:[email protected] <mailto:[email protected]>>> >> >>> wrote: >> >>>> >> >>>> any one who have the same question with me? or this is not a >> >>> question? >> >>>> >> >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng <[email protected] >> >>> <mailto:[email protected]> >> >>>> <mailto:[email protected] <mailto: >> >>> [email protected]>>>: >> >>>> >> >>>> hi, Moon: >> >>>> I notice that the getScheduler function in the >> >>>> SparkInterpreter.java return a FIFOScheduler which makes the >> >>>> spark interpreter run spark job one by one. It's not a good >> >>>> experience when couple of users do some work on zeppelin at >> >>>> the same time, because they have to wait for each other. >> >>>> And at the same time, SparkSqlInterpreter can chose what >> >>>> scheduler to use by "zeppelin.spark.concurrentSQL". >> >>>> My question is, what kind of consideration do you based on >> >>> to >> >>>> make such a decision? >> >>> >> >>> >> >>> >> >>> >> >>> >> ------------------------------------------------------------------------------------------------------------------------------------------ >> >>> >> >>> This email and any files transmitted with it are confidential and >> >>> intended solely for the use of the individual or entity to whom >> >>> they are addressed. If you have received this email in error >> >>> please notify the system manager. This message contains >> >>> confidential information and is intended only for the individual >> >>> named. If you are not the named addressee you should not >> >>> disseminate, distribute or copy this e-mail. Please notify the >> >>> sender immediately by e-mail if you have received this e-mail by >> >>> mistake and delete this e-mail from your system. If you are not >> >>> the intended recipient you are notified that disclosing, copying, >> >>> distributing or taking any action in reliance on the contents of >> >>> this information is strictly prohibited. Although Flipkart has >> >>> taken reasonable precautions to ensure no viruses are present in >> >>> this email, the company cannot accept responsibility for any loss >> >>> or damage arising from the use of this email or attachments >> >> >> > >
