Anjana, how is this thread progressing? Who is looking at/ thinking about notebooks?
On Thu, Nov 26, 2015 at 9:19 AM, Anjana Fernando <[email protected]> wrote: > Hi Srinath, > > On Thu, Nov 26, 2015 at 9:08 AM, Srinath Perera <[email protected]> wrote: > >> Hi Anjana, >> >> Great!! I think the next step is deciding whether we do this with >> Zeppelin and or we build it from scratch. >> >> Pros of Zeppelin >> >> 1. We get lot of features OOB >> 2. Code maintained by community, patches etc. >> 3. New features will get added and it will evolve >> 4. We get to contribute to an Apache project and build recognition >> >> Cons >> >> 1. Real deep integration might be lot of work ( we get initial >> version very fast, but integrating details .. e.g. make our UIs work >> in Zeppelin, or get Zeppelin to post to UES) might be tricky. >> 2. Zeppelin is still in incubator >> 3. Need to assess community >> >> I suggest you guys have a detailed chat with MiyuruD, who looked at it in >> detail, try out things, thing about it and report back. >> > > +1, we'll work with Miyuru also and see how to go forward. > > >> >> >> On Thu, Nov 26, 2015 at 3:12 AM, Anjana Fernando <[email protected]> wrote: >> >>> Hi Srinath, >>> >>> The story looks good. For that part about, the "user can play with the >>> data interactively", to make it more functional, we should probably >>> consider integration of Scala scripts to the mix, rather than only having >>> Spark SQL. Spark SQL maybe limited in functionality on certain data >>> operations, and with Scala, we should be able to use all the functionality >>> of Spark. For example, it would be easier to integrate ML operations with >>> other batch operations etc.. to create a more natural flow of operations. >>> The implementation may be tricky though, considering clustering, >>> multi-tenancy etc.. >>> >> Lets keep Scala version post MVP. >> > > Sure. > > >> >> >>> >>> Also, I would like to also bring up the question on, are most batch jobs >>> actually meant to be scheduled as such repeatedly, for a data set that >>> actually grows always? .. or is it mostly a thing where we execute >>> something once and get the results and that's it. Maybe this is a different >>> discussion though. But, for scheduled batch jobs as such, I guess >>> incremental processing would be critical, which no one seems to bother that >>> much though. >>> >> I think it is mostly scheduled batches as we have. Shall we take this up >> in a different thread? >> > > Yep, sure. > > >> >> >>> >>> Cheers, >>> Anjana. >>> >>> On Mon, Nov 23, 2015 at 2:57 PM, Srinath Perera <[email protected]> >>> wrote: >>> >>>> Hi All, >>>> >>>> I tried to write down the use cases, to start thinking about this >>>> starting from what we discussed in the meeting. Please comment. ( doc is at >>>> https://docs.google.com/document/d/1355YEXbhcd2fvS-zG_CiMigT-iTncxYn3DTHlJRTYyo/edit# >>>> ( same content is below). >>>> >>>> Thanks >>>> Srinath >>>> Batch, interactive, and Predictive Story >>>> >>>> 1. >>>> >>>> Data is uploaded to the system or send as a data stream and >>>> collected for some time ( in DAS) >>>> 2. >>>> >>>> Data Scientist come in and select a data set, and look at schema of >>>> data and do standard descriptive statistics like Mean, Max, Percentiles >>>> and >>>> standard deviation about the data. >>>> 3. >>>> >>>> Data Scientist cleans up the data using series of transformations. >>>> This might include combining multiple data sets into one data set. >>>> [Notebooks] >>>> 4. >>>> >>>> He can play with the data interactively >>>> 5. >>>> >>>> He visualize the data in several ways [Notebooks] >>>> 6. >>>> >>>> If he need descriptive statistics, he can export the data mutations >>>> in the notebooks as a script and schedule it. >>>> 7. >>>> >>>> If what he needs is machine learning, he can initialize and run the >>>> ML Wizard from the Notebooks and create a model. >>>> 8. >>>> >>>> He can export the model he created and any data mutation operations >>>> he did as a script and deploy both the model and data mutation >>>> operations >>>> in the CEP ( Realtime Pipeline). This is the actual transaction flow. >>>> 9. >>>> >>>> He can export the data mutation operations and machine learning >>>> model building logic as a script and schedule it to run periodically. >>>> This >>>> is the >>>> >>>> >>>> >>>> [image: NotebookPipeline.png] >>>> >>>> >>>> >>>> Realtime Story >>>> >>>> Realtime story also we can start with a data set, write realtime >>>> queries, test them by replaying the data, and then only we deploy queries. >>>> ( We do this event now). We can do the same. >>>> >>>> >>>> 1. >>>> >>>> User start with a dataset. >>>> 2. >>>> >>>> He write a set of queries using dataset as a stream. Streams and >>>> dataset shares the same record format. For example, consider the >>>> following >>>> data set. >>>> >>>> >>>> We can consider this as a batch data set by taking it as a whole or as >>>> a stream by taking record by record. >>>> >>>> For example, if we run query >>>> >>>> select * from CountryData where GDP>35000 >>>> >>>> it will provide following results. >>>> >>>> >>>> >>>> >>>> 1. >>>> >>>> Tables created by replay data with CEP queries, we can visualize >>>> like other data. ( except that time is special) >>>> 2. >>>> >>>> When Data Scientist is happy, Data Scientist can click a button and >>>> export the CEP queries as a execution plan and any charts as a realtime >>>> gadgets. ( one complication is time is special, and we need to transform >>>> from any visualization to time based visualization) >>>> >>>> >>>> -- >>>> ============================ >>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>>> Site: http://people.apache.org/~hemapani/ >>>> Photos: http://www.flickr.com/photos/hemapani/ >>>> Phone: 0772360902 >>>> >>> >>> >>> >>> -- >>> *Anjana Fernando* >>> Senior Technical Lead >>> WSO2 Inc. | http://wso2.com >>> lean . enterprise . middleware >>> >> >> >> >> -- >> ============================ >> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >> Site: http://people.apache.org/~hemapani/ >> Photos: http://www.flickr.com/photos/hemapani/ >> Phone: 0772360902 >> > > > > -- > *Anjana Fernando* > Senior Technical Lead > WSO2 Inc. | http://wso2.com > lean . enterprise . middleware > -- ============================ Blog: http://srinathsview.blogspot.com twitter:@srinath_perera Site: http://people.apache.org/~hemapani/ Photos: http://www.flickr.com/photos/hemapani/ Phone: 0772360902
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
