Hi Anjana,

Great!! I think the next step is deciding whether we do this with Zeppelin
and or we build it from scratch.

Pros of Zeppelin

   1. We get lot of features OOB
   2. Code maintained by community, patches etc.
   3. New features will get added and it will evolve
   4. We get to contribute to an Apache project and build recognition

Cons

   1. Real deep integration might be lot of work ( we get initial version
   very fast, but integrating details .. e.g. make our UIs work in Zeppelin,
   or get Zeppelin to post to UES) might be tricky.
   2. Zeppelin is still in incubator
   3. Need to assess community

I suggest you guys have a detailed chat with MiyuruD, who looked at it in
detail, try out things, thing about it and report back.


On Thu, Nov 26, 2015 at 3:12 AM, Anjana Fernando <[email protected]> wrote:

> Hi Srinath,
>
> The story looks good. For that part about, the "user can play with the
> data interactively", to make it more functional, we should probably
> consider integration of Scala scripts to the mix, rather than only having
> Spark SQL. Spark SQL maybe limited in functionality on certain data
> operations, and with Scala, we should be able to use all the functionality
> of Spark. For example, it would be easier to integrate ML operations with
> other batch operations etc.. to create a more natural flow of operations.
> The implementation may be tricky though, considering clustering,
> multi-tenancy etc..
>
Lets keep Scala version post MVP.


>
> Also, I would like to also bring up the question on, are most batch jobs
> actually meant to be scheduled as such repeatedly, for a data set that
> actually grows always? .. or is it mostly a thing where we execute
> something once and get the results and that's it. Maybe this is a different
> discussion though. But, for scheduled batch jobs as such, I guess
> incremental processing would be critical, which no one seems to bother that
> much though.
>
I think it is mostly scheduled batches as we have. Shall we take this up in
a different thread?


>
> Cheers,
> Anjana.
>
> On Mon, Nov 23, 2015 at 2:57 PM, Srinath Perera <[email protected]> wrote:
>
>> Hi All,
>>
>> I tried to write down the use cases, to start thinking about this
>> starting from what we discussed in the meeting. Please comment. ( doc is at
>> https://docs.google.com/document/d/1355YEXbhcd2fvS-zG_CiMigT-iTncxYn3DTHlJRTYyo/edit#
>> ( same content is below).
>>
>> Thanks
>> Srinath
>> Batch, interactive, and Predictive Story
>>
>>    1.
>>
>>    Data is uploaded to the system or send as a data stream and collected
>>    for some time ( in DAS)
>>    2.
>>
>>    Data Scientist come in and select a data set, and look at schema of
>>    data and do standard descriptive statistics like Mean, Max, Percentiles 
>> and
>>    standard deviation about the data.
>>    3.
>>
>>    Data Scientist cleans up the data using series of transformations.
>>    This might include combining multiple data sets into one data set.
>>     [Notebooks]
>>    4.
>>
>>    He can play with the data interactively
>>    5.
>>
>>    He visualize the data in several ways [Notebooks]
>>    6.
>>
>>    If he need descriptive statistics, he can export the data mutations
>>    in the notebooks as a script and schedule it.
>>    7.
>>
>>    If what he needs is machine learning, he can initialize and run the
>>    ML Wizard from the Notebooks and create a model.
>>    8.
>>
>>    He can export the model he created and any data mutation operations
>>    he did as a script and deploy both the model and data mutation operations
>>    in the CEP ( Realtime Pipeline). This is the actual transaction flow.
>>    9.
>>
>>    He can export the data mutation operations and machine learning model
>>    building logic as a script and schedule it to run periodically. This is 
>> the
>>
>>
>>
>> [image: NotebookPipeline.png]
>>
>>
>>
>> Realtime Story
>>
>> Realtime story also we can start with a data set, write realtime queries,
>> test them by replaying the data, and then only we deploy queries. ( We do
>> this event now). We can do the same.
>>
>>
>>    1.
>>
>>    User start with a dataset.
>>    2.
>>
>>    He write a set of queries using dataset as a stream. Streams and
>>    dataset shares the same record format. For example, consider the following
>>    data set.
>>
>>
>> We can consider this as a batch data set by taking it as a whole or as a
>> stream by taking record by record.
>>
>> For example, if we run query
>>
>> select * from CountryData where GDP>35000
>>
>> it will provide following results.
>>
>>
>>
>>
>>    1.
>>
>>    Tables created by replay data with CEP queries, we can visualize like
>>    other data. ( except that time is special)
>>    2.
>>
>>    When Data Scientist is happy, Data Scientist can click a button and
>>    export the CEP queries as a execution plan and any charts as a realtime
>>    gadgets. ( one complication is time is special, and we need to transform
>>    from any visualization to time based visualization)
>>
>>
>> --
>> ============================
>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>> Site: http://people.apache.org/~hemapani/
>> Photos: http://www.flickr.com/photos/hemapani/
>> Phone: 0772360902
>>
>
>
>
> --
> *Anjana Fernando*
> Senior Technical Lead
> WSO2 Inc. | http://wso2.com
> lean . enterprise . middleware
>



-- 
============================
Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
Site: http://people.apache.org/~hemapani/
Photos: http://www.flickr.com/photos/hemapani/
Phone: 0772360902
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to