Re: [DISCUSS] Share Data in Zeppelin

Jeff Zhang Thu, 12 Jul 2018 19:26:35 -0700

I think your use case is the same of 2.b.  Personally I don't recommend to
use z.get(noteId, paragraphId) to get the shared data for 2 reasons
1.  noteId, paragraphId is meaningless, which is not readable
2. The note will break if we clone it as the noteId is changed.
That's why I suggest to use paragraph property to save paragraph's result


Regarding the intermediate storage, I also though about it and agree that
in the long term we should provide such layer to support large data,
currently we put the shared data in memory which is not a scalable
solution.  One candidate in my mind is alluxio [1], and regarding the data
format I think apache arrow [2] is another good option for zeppelin to
share table data across interpreter processes and different languages. But
these are all implementation details, I think we can talk about them in
another thread. In this thread, I think we should focus on the user facing
api.


[1] http://www.alluxio.org/
[2] https://arrow.apache.org/



Jongyoul Lee <jongy...@gmail.com>于2018年7月13日周五 上午10:11写道：

> I have a bit different idea to share data.
>
> In my case,
>
> It would be very useful to get a paragraph's result as an input of other
> paragraphs.
>
> e.g.
>
> -- Paragrph 1
> %jdbc
> select * from some_table;
>
> -- Paragraph 2
> %spark
> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
> spark.read(table).select....
>
> If paragraph 1's result is too big to show on FE, it would be saved in
> Zeppelin Server with proper way and pass to SparkInterpreter when Paragraph
> 2 is executed.
>
> Basically, I think we need to intermediate storage to store paragraph's
> results to share them. We can introduce another layer or extend
> NotebootRepo. In some cases, we might change notebook repos as well.
>
> JL
>
>
>
> On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> Hi Folks,
>>
>> Recently, there's several tickets [1][2][3] about sharing data in
>> zeppelin.
>> Zeppelin's goal is to be an unified data analyst platform which could
>> integrate most of the big data tools and help user to switch between tools
>> and share data between tools easily. So sharing data is a very critical
>> and
>> killer feature of Zeppelin IMHO.
>>
>> I raise this ticket to discuss about the scenario of sharing data and how
>> to do that. Although zeppelin already provides tools and api to share
>> data,
>> I don't think it is mature and stable enough. After seeing these tickets,
>> I
>> think it might be a good time to talk about it in community and gather
>> more
>> feedback, so that we could provide a more stable and mature approach for
>> it.
>>
>> Currently, there're 3 approaches to share data between interpreters and
>> interpreter processes.
>> 1. Sharing data across interpreter in the same interpreter process. Like
>> sharing data via the same SparkContext in %spark, %spark.pyspark and
>> %spark.r.
>> 2. Sharing data between frontend and backend via angularObject
>> 3. Sharing data across interpreter processes via Zeppelin's ResourcePool
>>
>> For this thread, I would like to talk about the approach 3 (Sharing data
>> via Zeppelin's ResourcePool)
>>
>> Here's my current thinking of sharing data.
>> 1. What kind of data would be shared ?
>>    IMHO, users would share 2 kinds of data: primitive data (string,
>> number)
>> and table data.
>>
>> 2. How to write shared data ?
>>     User may want to share data via 2 approches
>>     a. Use ZeppelinContext (e.g. z.put).
>>     b. Share the paragraph result via paragraph properties. e.g. user may
>> want to read data from oracle database via jdbc interpreter and then do
>> plotting in python interpreter. In such scenario. he can save the jdbc
>> result in ResourcePool via paragraph property and then read it it via
>> z.get. Here's one simple example (Not implemented yet)
>>
>>         %jdbc(saveAsTable=people)
>>          select * from oracle_table
>>
>>          %python
>>          z.getTable("people).toPandas()
>>
>> 3. How to read shared data ?
>>     User can also have 2 approaches to read the shared data.
>>     a. Via ZeppelinContext. (e.g.  z.get, z.getTable)
>>     b. Via variable substitution [1]
>>
>> Here's one sample note which illustrate the scenario of sharing data.
>>
>> https://www.zepl.com/viewer/notebooks/bm90ZTovL3pqZmZkdS8zMzkxZjg3YmFhMjg0MDY3OGM1ZmYzODAwODAxMGJhNy9ub3RlLmpzb24
>>
>> This is just my current thinking of sharing data in zeppelin, it
>> definitely
>> doesn't cover all the scenarios, so I raise this thread to discuss about
>> in
>> community, welcome any feedback and comments.
>>
>>
>> [1]. https://issues.apache.org/jira/browse/ZEPPELIN-3377
>> [2]. https://issues.apache.org/jira/browse/ZEPPELIN-3596
>> [3]. https://issues.apache.org/jira/browse/ZEPPELIN-3617
>>
>
>
>
> --
> 이종열, Jongyoul Lee, 李宗烈
> http://madeng.net
>

Re: [DISCUSS] Share Data in Zeppelin

Reply via email to