Hi Srinath,

On Thu, Nov 26, 2015 at 9:08 AM, Srinath Perera <[email protected]> wrote:

> Hi Anjana,
>
> Great!! I think the next step is deciding whether we do this with Zeppelin
> and or we build it from scratch.
>
> Pros of Zeppelin
>
>    1. We get lot of features OOB
>    2. Code maintained by community, patches etc.
>    3. New features will get added and it will evolve
>    4. We get to contribute to an Apache project and build recognition
>
> Cons
>
>    1. Real deep integration might be lot of work ( we get initial version
>    very fast, but integrating details .. e.g. make our UIs work in Zeppelin,
>    or get Zeppelin to post to UES) might be tricky.
>    2. Zeppelin is still in incubator
>    3. Need to assess community
>
> I suggest you guys have a detailed chat with MiyuruD, who looked at it in
> detail, try out things, thing about it and report back.
>

+1, we'll work with Miyuru also and see how to go forward.


>
>
> On Thu, Nov 26, 2015 at 3:12 AM, Anjana Fernando <[email protected]> wrote:
>
>> Hi Srinath,
>>
>> The story looks good. For that part about, the "user can play with the
>> data interactively", to make it more functional, we should probably
>> consider integration of Scala scripts to the mix, rather than only having
>> Spark SQL. Spark SQL maybe limited in functionality on certain data
>> operations, and with Scala, we should be able to use all the functionality
>> of Spark. For example, it would be easier to integrate ML operations with
>> other batch operations etc.. to create a more natural flow of operations.
>> The implementation may be tricky though, considering clustering,
>> multi-tenancy etc..
>>
> Lets keep Scala version post MVP.
>

Sure.


>
>
>>
>> Also, I would like to also bring up the question on, are most batch jobs
>> actually meant to be scheduled as such repeatedly, for a data set that
>> actually grows always? .. or is it mostly a thing where we execute
>> something once and get the results and that's it. Maybe this is a different
>> discussion though. But, for scheduled batch jobs as such, I guess
>> incremental processing would be critical, which no one seems to bother that
>> much though.
>>
> I think it is mostly scheduled batches as we have. Shall we take this up
> in a different thread?
>

Yep, sure.


>
>
>>
>> Cheers,
>> Anjana.
>>
>> On Mon, Nov 23, 2015 at 2:57 PM, Srinath Perera <[email protected]> wrote:
>>
>>> Hi All,
>>>
>>> I tried to write down the use cases, to start thinking about this
>>> starting from what we discussed in the meeting. Please comment. ( doc is at
>>> https://docs.google.com/document/d/1355YEXbhcd2fvS-zG_CiMigT-iTncxYn3DTHlJRTYyo/edit#
>>> ( same content is below).
>>>
>>> Thanks
>>> Srinath
>>> Batch, interactive, and Predictive Story
>>>
>>>    1.
>>>
>>>    Data is uploaded to the system or send as a data stream and
>>>    collected for some time ( in DAS)
>>>    2.
>>>
>>>    Data Scientist come in and select a data set, and look at schema of
>>>    data and do standard descriptive statistics like Mean, Max, Percentiles 
>>> and
>>>    standard deviation about the data.
>>>    3.
>>>
>>>    Data Scientist cleans up the data using series of transformations.
>>>    This might include combining multiple data sets into one data set.
>>>     [Notebooks]
>>>    4.
>>>
>>>    He can play with the data interactively
>>>    5.
>>>
>>>    He visualize the data in several ways [Notebooks]
>>>    6.
>>>
>>>    If he need descriptive statistics, he can export the data mutations
>>>    in the notebooks as a script and schedule it.
>>>    7.
>>>
>>>    If what he needs is machine learning, he can initialize and run the
>>>    ML Wizard from the Notebooks and create a model.
>>>    8.
>>>
>>>    He can export the model he created and any data mutation operations
>>>    he did as a script and deploy both the model and data mutation operations
>>>    in the CEP ( Realtime Pipeline). This is the actual transaction flow.
>>>    9.
>>>
>>>    He can export the data mutation operations and machine learning
>>>    model building logic as a script and schedule it to run periodically. 
>>> This
>>>    is the
>>>
>>>
>>>
>>> [image: NotebookPipeline.png]
>>>
>>>
>>>
>>> Realtime Story
>>>
>>> Realtime story also we can start with a data set, write realtime
>>> queries, test them by replaying the data, and then only we deploy queries.
>>> ( We do this event now). We can do the same.
>>>
>>>
>>>    1.
>>>
>>>    User start with a dataset.
>>>    2.
>>>
>>>    He write a set of queries using dataset as a stream. Streams and
>>>    dataset shares the same record format. For example, consider the 
>>> following
>>>    data set.
>>>
>>>
>>> We can consider this as a batch data set by taking it as a whole or as a
>>> stream by taking record by record.
>>>
>>> For example, if we run query
>>>
>>> select * from CountryData where GDP>35000
>>>
>>> it will provide following results.
>>>
>>>
>>>
>>>
>>>    1.
>>>
>>>    Tables created by replay data with CEP queries, we can visualize
>>>    like other data. ( except that time is special)
>>>    2.
>>>
>>>    When Data Scientist is happy, Data Scientist can click a button and
>>>    export the CEP queries as a execution plan and any charts as a realtime
>>>    gadgets. ( one complication is time is special, and we need to transform
>>>    from any visualization to time based visualization)
>>>
>>>
>>> --
>>> ============================
>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>> Site: http://people.apache.org/~hemapani/
>>> Photos: http://www.flickr.com/photos/hemapani/
>>> Phone: 0772360902
>>>
>>
>>
>>
>> --
>> *Anjana Fernando*
>> Senior Technical Lead
>> WSO2 Inc. | http://wso2.com
>> lean . enterprise . middleware
>>
>
>
>
> --
> ============================
> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
> Site: http://people.apache.org/~hemapani/
> Photos: http://www.flickr.com/photos/hemapani/
> Phone: 0772360902
>



-- 
*Anjana Fernando*
Senior Technical Lead
WSO2 Inc. | http://wso2.com
lean . enterprise . middleware
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to