Re: [Architecture] Notebook Support Use cases for DAS

Srinath Perera Mon, 07 Dec 2015 19:50:06 -0800

Anjana, how is this thread progressing? Who is looking at/ thinking about
notebooks?


On Thu, Nov 26, 2015 at 9:19 AM, Anjana Fernando <[email protected]> wrote:

> Hi Srinath,
>
> On Thu, Nov 26, 2015 at 9:08 AM, Srinath Perera <[email protected]> wrote:
>
>> Hi Anjana,
>>
>> Great!! I think the next step is deciding whether we do this with
>> Zeppelin and or we build it from scratch.
>>
>> Pros of Zeppelin
>>
>>    1. We get lot of features OOB
>>    2. Code maintained by community, patches etc.
>>    3. New features will get added and it will evolve
>>    4. We get to contribute to an Apache project and build recognition
>>
>> Cons
>>
>>    1. Real deep integration might be lot of work ( we get initial
>>    version very fast, but integrating details .. e.g. make our UIs work
>>    in Zeppelin, or get Zeppelin to post to UES) might be tricky.
>>    2. Zeppelin is still in incubator
>>    3. Need to assess community
>>
>> I suggest you guys have a detailed chat with MiyuruD, who looked at it in
>> detail, try out things, thing about it and report back.
>>
>
> +1, we'll work with Miyuru also and see how to go forward.
>
>
>>
>>
>> On Thu, Nov 26, 2015 at 3:12 AM, Anjana Fernando <[email protected]> wrote:
>>
>>> Hi Srinath,
>>>
>>> The story looks good. For that part about, the "user can play with the
>>> data interactively", to make it more functional, we should probably
>>> consider integration of Scala scripts to the mix, rather than only having
>>> Spark SQL. Spark SQL maybe limited in functionality on certain data
>>> operations, and with Scala, we should be able to use all the functionality
>>> of Spark. For example, it would be easier to integrate ML operations with
>>> other batch operations etc.. to create a more natural flow of operations.
>>> The implementation may be tricky though, considering clustering,
>>> multi-tenancy etc..
>>>
>> Lets keep Scala version post MVP.
>>
>
> Sure.
>
>
>>
>>
>>>
>>> Also, I would like to also bring up the question on, are most batch jobs
>>> actually meant to be scheduled as such repeatedly, for a data set that
>>> actually grows always? .. or is it mostly a thing where we execute
>>> something once and get the results and that's it. Maybe this is a different
>>> discussion though. But, for scheduled batch jobs as such, I guess
>>> incremental processing would be critical, which no one seems to bother that
>>> much though.
>>>
>> I think it is mostly scheduled batches as we have. Shall we take this up
>> in a different thread?
>>
>
> Yep, sure.
>
>
>>
>>
>>>
>>> Cheers,
>>> Anjana.
>>>
>>> On Mon, Nov 23, 2015 at 2:57 PM, Srinath Perera <[email protected]>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I tried to write down the use cases, to start thinking about this
>>>> starting from what we discussed in the meeting. Please comment. ( doc is at
>>>> https://docs.google.com/document/d/1355YEXbhcd2fvS-zG_CiMigT-iTncxYn3DTHlJRTYyo/edit#
>>>> ( same content is below).
>>>>
>>>> Thanks
>>>> Srinath
>>>> Batch, interactive, and Predictive Story
>>>>
>>>>    1.
>>>>
>>>>    Data is uploaded to the system or send as a data stream and
>>>>    collected for some time ( in DAS)
>>>>    2.
>>>>
>>>>    Data Scientist come in and select a data set, and look at schema of
>>>>    data and do standard descriptive statistics like Mean, Max, Percentiles 
>>>> and
>>>>    standard deviation about the data.
>>>>    3.
>>>>
>>>>    Data Scientist cleans up the data using series of transformations.
>>>>    This might include combining multiple data sets into one data set.
>>>>     [Notebooks]
>>>>    4.
>>>>
>>>>    He can play with the data interactively
>>>>    5.
>>>>
>>>>    He visualize the data in several ways [Notebooks]
>>>>    6.
>>>>
>>>>    If he need descriptive statistics, he can export the data mutations
>>>>    in the notebooks as a script and schedule it.
>>>>    7.
>>>>
>>>>    If what he needs is machine learning, he can initialize and run the
>>>>    ML Wizard from the Notebooks and create a model.
>>>>    8.
>>>>
>>>>    He can export the model he created and any data mutation operations
>>>>    he did as a script and deploy both the model and data mutation 
>>>> operations
>>>>    in the CEP ( Realtime Pipeline). This is the actual transaction flow.
>>>>    9.
>>>>
>>>>    He can export the data mutation operations and machine learning
>>>>    model building logic as a script and schedule it to run periodically. 
>>>> This
>>>>    is the
>>>>
>>>>
>>>>
>>>> [image: NotebookPipeline.png]
>>>>
>>>>
>>>>
>>>> Realtime Story
>>>>
>>>> Realtime story also we can start with a data set, write realtime
>>>> queries, test them by replaying the data, and then only we deploy queries.
>>>> ( We do this event now). We can do the same.
>>>>
>>>>
>>>>    1.
>>>>
>>>>    User start with a dataset.
>>>>    2.
>>>>
>>>>    He write a set of queries using dataset as a stream. Streams and
>>>>    dataset shares the same record format. For example, consider the 
>>>> following
>>>>    data set.
>>>>
>>>>
>>>> We can consider this as a batch data set by taking it as a whole or as
>>>> a stream by taking record by record.
>>>>
>>>> For example, if we run query
>>>>
>>>> select * from CountryData where GDP>35000
>>>>
>>>> it will provide following results.
>>>>
>>>>
>>>>
>>>>
>>>>    1.
>>>>
>>>>    Tables created by replay data with CEP queries, we can visualize
>>>>    like other data. ( except that time is special)
>>>>    2.
>>>>
>>>>    When Data Scientist is happy, Data Scientist can click a button and
>>>>    export the CEP queries as a execution plan and any charts as a realtime
>>>>    gadgets. ( one complication is time is special, and we need to transform
>>>>    from any visualization to time based visualization)
>>>>
>>>>
>>>> --
>>>> ============================
>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>> Site: http://people.apache.org/~hemapani/
>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>> Phone: 0772360902
>>>>
>>>
>>>
>>>
>>> --
>>> *Anjana Fernando*
>>> Senior Technical Lead
>>> WSO2 Inc. | http://wso2.com
>>> lean . enterprise . middleware
>>>
>>
>>
>>
>> --
>> ============================
>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>> Site: http://people.apache.org/~hemapani/
>> Photos: http://www.flickr.com/photos/hemapani/
>> Phone: 0772360902
>>
>
>
>
> --
> *Anjana Fernando*
> Senior Technical Lead
> WSO2 Inc. | http://wso2.com
> lean . enterprise . middleware
>



-- 
============================
Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
Site: http://people.apache.org/~hemapani/
Photos: http://www.flickr.com/photos/hemapani/
Phone: 0772360902

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Notebook Support Use cases for DAS

Reply via email to