[ 
https://issues.apache.org/jira/browse/BIGTOP-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated BIGTOP-1089:
-----------------------------

    Comment: was deleted

(was: The need for templates for processing big data pipelines is obvious - and 
also - given the increasing amount of overlap across different big data and 
nosql projects, it will provide a ground truth in the future for comparing the 
behaviour and approach of different tools to solve a common, easily 
comprehended problem. 

This ticket formalizes the conversation in mailing list archives regarding the 
BigPetStore proposal. 

At the moment, (with the exception of word count), there are very few examples 
of bigdata problems that have been solved by a variety of different 
technologies.  And, even with wordcount, there arent alot of templates which 
can be customized for applications. 

Comparatively: Other application developer communities (i.e.the Rails folks, 
those using maven archetypes, etc.. ) have a plethora of template applications 
which can be used to kickstart their applications and use cases.   

This big pet store JIRA thus aims to do the following: 

0) Curate a single, central, standard input data set . (modified: generating a 
large input data set on the fly).

1) Define a big data processing pipeline (using the pet store theme - except 
morphing it to be analytics rather than transaction oriented), and implement 
basic aggregations in hive, pig, etc...

2) Sink the results of 2 into some kind of NoSQL store or search engine.
 
Some implementation details -- open to change these, please comment/review -- .

- initial data source will be raw text or (better yet) some kind of 
automatically generated data.
- the source will initially go in bigtop/blueprints
- the application sources can be in any modern JVM language 
(java,scala,groovy,clojure), since bigtop supports scala, java, groovy natively 
already and clojure is easy to support with the right jars.  
- each "job" will be named according to the corresponding DAG of the big data 
pipeline . 
- all jobs should (not sure if requirement?) be controlled by a global program 
(maybe oozie?) which runs the tasks in order, and can easily be customized to 
use different tools at different stages. 
- for now, all outputs will be to files: so that users don't require servers to 
run the app. 
- final data sinks will be into a highly available transaction oriented store 
(solr/hbase/...)

This ticket will be completed once a first iteration of BigPetStore is complete 
using 3 ecosystem components, along with a depiction of the pipeline which can 
be used for development.

I've assigned this to myself :) I hope thats okay? Seems like at the moment im 
the only one working on it. 
)

> BigPetStore: A bigtop blueprint project inside of bigtop
> --------------------------------------------------------
>
>                 Key: BIGTOP-1089
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-1089
>             Project: Bigtop
>          Issue Type: New Feature
>          Components: Blueprints
>            Reporter: jay vyas
>
> The need for templates for processing big data pipelines is obvious - and 
> also - given the increasing amount of overlap across different big data and 
> nosql projects, it will provide a ground truth in the future for comparing 
> the behaviour and approach of different tools to solve a common, easily 
> comprehended problem. 
> This ticket formalizes the conversation in mailing list archives regarding 
> the BigPetStore proposal. 
> At the moment, (with the exception of word count), there are very few 
> examples of bigdata problems that have been solved by a variety of different 
> technologies.  And, even with wordcount, there arent alot of templates which 
> can be customized for applications. 
> Comparatively: Other application developer communities (i.e.the Rails folks, 
> those using maven archetypes, etc.. ) have a plethora of template 
> applications which can be used to kickstart their applications and use cases. 
>   
> This big pet store JIRA thus aims to do the following: 
> 0) Curate a single, central, standard input data set .
> 1) Define a big data processing pipeline (using the pet store theme - except 
> morphing it to be analytics rather than transaction oriented). 
> 2) Define specific nodes as DAG in the pipeline.  For example, the first step 
> would be to ETL some text files into the Hadoop cluster's default FileSystem. 
>  Each task would have multiple implementations, using different component of 
> the bigtop stack.  Using word count as the example: We would have word count 
> implementations in pig, hive, mapreduce, spark, etc... and unit tests which 
> confirmed that each one accomplished the exact same output. 
> 3) Package the project as a maven archetype that can be used to create new 
> java based bigdata processor frameworks. 
> Some implementation details -- open to change these, please comment/review -- 
> .
> - initial data source will be raw text or (better yet) some kind of 
> automatically generated data.
> - the source will initially go in bigtop/blueprints
> - the application sources can be in any modern JVM language 
> (java,scala,groovy,clojure), since bigtop supports scala, java, groovy 
> natively already and clojure is easy to support with the right jars.  
> - each "job" will be named according to the corresponding DAG of the big data 
> pipeline . 
> - all jobs will be controlled by a global program (maybe oozie?) which runs 
> the tasks in order, and can easily be customized to use different tools at 
> different stages. 
> - for now, all outputs will be to files: so that users don't require servers 
> to run the app. 
> - final data sinks will be into a highly available transaction oriented store 
> (solr/hbase/riak/...)
> This ticket will be completed once a first iteration of BigPetStore is 
> complete using 3 ecosystem components, along with a depiction of the pipeline 
> which can be used for development.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to