Hi Srinath, On Thu, Nov 26, 2015 at 9:08 AM, Srinath Perera <[email protected]> wrote:
> Hi Anjana, > > Great!! I think the next step is deciding whether we do this with Zeppelin > and or we build it from scratch. > > Pros of Zeppelin > > 1. We get lot of features OOB > 2. Code maintained by community, patches etc. > 3. New features will get added and it will evolve > 4. We get to contribute to an Apache project and build recognition > > Cons > > 1. Real deep integration might be lot of work ( we get initial version > very fast, but integrating details .. e.g. make our UIs work in Zeppelin, > or get Zeppelin to post to UES) might be tricky. > 2. Zeppelin is still in incubator > 3. Need to assess community > > I suggest you guys have a detailed chat with MiyuruD, who looked at it in > detail, try out things, thing about it and report back. > +1, we'll work with Miyuru also and see how to go forward. > > > On Thu, Nov 26, 2015 at 3:12 AM, Anjana Fernando <[email protected]> wrote: > >> Hi Srinath, >> >> The story looks good. For that part about, the "user can play with the >> data interactively", to make it more functional, we should probably >> consider integration of Scala scripts to the mix, rather than only having >> Spark SQL. Spark SQL maybe limited in functionality on certain data >> operations, and with Scala, we should be able to use all the functionality >> of Spark. For example, it would be easier to integrate ML operations with >> other batch operations etc.. to create a more natural flow of operations. >> The implementation may be tricky though, considering clustering, >> multi-tenancy etc.. >> > Lets keep Scala version post MVP. > Sure. > > >> >> Also, I would like to also bring up the question on, are most batch jobs >> actually meant to be scheduled as such repeatedly, for a data set that >> actually grows always? .. or is it mostly a thing where we execute >> something once and get the results and that's it. Maybe this is a different >> discussion though. But, for scheduled batch jobs as such, I guess >> incremental processing would be critical, which no one seems to bother that >> much though. >> > I think it is mostly scheduled batches as we have. Shall we take this up > in a different thread? > Yep, sure. > > >> >> Cheers, >> Anjana. >> >> On Mon, Nov 23, 2015 at 2:57 PM, Srinath Perera <[email protected]> wrote: >> >>> Hi All, >>> >>> I tried to write down the use cases, to start thinking about this >>> starting from what we discussed in the meeting. Please comment. ( doc is at >>> https://docs.google.com/document/d/1355YEXbhcd2fvS-zG_CiMigT-iTncxYn3DTHlJRTYyo/edit# >>> ( same content is below). >>> >>> Thanks >>> Srinath >>> Batch, interactive, and Predictive Story >>> >>> 1. >>> >>> Data is uploaded to the system or send as a data stream and >>> collected for some time ( in DAS) >>> 2. >>> >>> Data Scientist come in and select a data set, and look at schema of >>> data and do standard descriptive statistics like Mean, Max, Percentiles >>> and >>> standard deviation about the data. >>> 3. >>> >>> Data Scientist cleans up the data using series of transformations. >>> This might include combining multiple data sets into one data set. >>> [Notebooks] >>> 4. >>> >>> He can play with the data interactively >>> 5. >>> >>> He visualize the data in several ways [Notebooks] >>> 6. >>> >>> If he need descriptive statistics, he can export the data mutations >>> in the notebooks as a script and schedule it. >>> 7. >>> >>> If what he needs is machine learning, he can initialize and run the >>> ML Wizard from the Notebooks and create a model. >>> 8. >>> >>> He can export the model he created and any data mutation operations >>> he did as a script and deploy both the model and data mutation operations >>> in the CEP ( Realtime Pipeline). This is the actual transaction flow. >>> 9. >>> >>> He can export the data mutation operations and machine learning >>> model building logic as a script and schedule it to run periodically. >>> This >>> is the >>> >>> >>> >>> [image: NotebookPipeline.png] >>> >>> >>> >>> Realtime Story >>> >>> Realtime story also we can start with a data set, write realtime >>> queries, test them by replaying the data, and then only we deploy queries. >>> ( We do this event now). We can do the same. >>> >>> >>> 1. >>> >>> User start with a dataset. >>> 2. >>> >>> He write a set of queries using dataset as a stream. Streams and >>> dataset shares the same record format. For example, consider the >>> following >>> data set. >>> >>> >>> We can consider this as a batch data set by taking it as a whole or as a >>> stream by taking record by record. >>> >>> For example, if we run query >>> >>> select * from CountryData where GDP>35000 >>> >>> it will provide following results. >>> >>> >>> >>> >>> 1. >>> >>> Tables created by replay data with CEP queries, we can visualize >>> like other data. ( except that time is special) >>> 2. >>> >>> When Data Scientist is happy, Data Scientist can click a button and >>> export the CEP queries as a execution plan and any charts as a realtime >>> gadgets. ( one complication is time is special, and we need to transform >>> from any visualization to time based visualization) >>> >>> >>> -- >>> ============================ >>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>> Site: http://people.apache.org/~hemapani/ >>> Photos: http://www.flickr.com/photos/hemapani/ >>> Phone: 0772360902 >>> >> >> >> >> -- >> *Anjana Fernando* >> Senior Technical Lead >> WSO2 Inc. | http://wso2.com >> lean . enterprise . middleware >> > > > > -- > ============================ > Blog: http://srinathsview.blogspot.com twitter:@srinath_perera > Site: http://people.apache.org/~hemapani/ > Photos: http://www.flickr.com/photos/hemapani/ > Phone: 0772360902 > -- *Anjana Fernando* Senior Technical Lead WSO2 Inc. | http://wso2.com lean . enterprise . middleware
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
