Hi Srinath, The story looks good. For that part about, the "user can play with the data interactively", to make it more functional, we should probably consider integration of Scala scripts to the mix, rather than only having Spark SQL. Spark SQL maybe limited in functionality on certain data operations, and with Scala, we should be able to use all the functionality of Spark. For example, it would be easier to integrate ML operations with other batch operations etc.. to create a more natural flow of operations. The implementation may be tricky though, considering clustering, multi-tenancy etc..
Also, I would like to also bring up the question on, are most batch jobs actually meant to be scheduled as such repeatedly, for a data set that actually grows always? .. or is it mostly a thing where we execute something once and get the results and that's it. Maybe this is a different discussion though. But, for scheduled batch jobs as such, I guess incremental processing would be critical, which no one seems to bother that much though. Cheers, Anjana. On Mon, Nov 23, 2015 at 2:57 PM, Srinath Perera <[email protected]> wrote: > Hi All, > > I tried to write down the use cases, to start thinking about this starting > from what we discussed in the meeting. Please comment. ( doc is at > https://docs.google.com/document/d/1355YEXbhcd2fvS-zG_CiMigT-iTncxYn3DTHlJRTYyo/edit# > ( same content is below). > > Thanks > Srinath > Batch, interactive, and Predictive Story > > 1. > > Data is uploaded to the system or send as a data stream and collected > for some time ( in DAS) > 2. > > Data Scientist come in and select a data set, and look at schema of > data and do standard descriptive statistics like Mean, Max, Percentiles and > standard deviation about the data. > 3. > > Data Scientist cleans up the data using series of transformations. > This might include combining multiple data sets into one data set. > [Notebooks] > 4. > > He can play with the data interactively > 5. > > He visualize the data in several ways [Notebooks] > 6. > > If he need descriptive statistics, he can export the data mutations in > the notebooks as a script and schedule it. > 7. > > If what he needs is machine learning, he can initialize and run the ML > Wizard from the Notebooks and create a model. > 8. > > He can export the model he created and any data mutation operations he > did as a script and deploy both the model and data mutation operations in > the CEP ( Realtime Pipeline). This is the actual transaction flow. > 9. > > He can export the data mutation operations and machine learning model > building logic as a script and schedule it to run periodically. This is the > > > > [image: NotebookPipeline.png] > > > > Realtime Story > > Realtime story also we can start with a data set, write realtime queries, > test them by replaying the data, and then only we deploy queries. ( We do > this event now). We can do the same. > > > 1. > > User start with a dataset. > 2. > > He write a set of queries using dataset as a stream. Streams and > dataset shares the same record format. For example, consider the following > data set. > > > We can consider this as a batch data set by taking it as a whole or as a > stream by taking record by record. > > For example, if we run query > > select * from CountryData where GDP>35000 > > it will provide following results. > > > > > 1. > > Tables created by replay data with CEP queries, we can visualize like > other data. ( except that time is special) > 2. > > When Data Scientist is happy, Data Scientist can click a button and > export the CEP queries as a execution plan and any charts as a realtime > gadgets. ( one complication is time is special, and we need to transform > from any visualization to time based visualization) > > > -- > ============================ > Blog: http://srinathsview.blogspot.com twitter:@srinath_perera > Site: http://people.apache.org/~hemapani/ > Photos: http://www.flickr.com/photos/hemapani/ > Phone: 0772360902 > -- *Anjana Fernando* Senior Technical Lead WSO2 Inc. | http://wso2.com lean . enterprise . middleware
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
