[
https://issues.apache.org/jira/browse/BIGTOP-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
jay vyas updated BIGTOP-1089:
-----------------------------
Component/s: (was: General)
Blueprints
> BigPetStore: A bigtop blueprint project inside of bigtop
> --------------------------------------------------------
>
> Key: BIGTOP-1089
> URL: https://issues.apache.org/jira/browse/BIGTOP-1089
> Project: Bigtop
> Issue Type: New Feature
> Components: Blueprints
> Reporter: jay vyas
>
> The need for templates for processing big data pipelines is obvious - and
> also - given the increasing amount of overlap across different big data and
> nosql projects, it will provide a ground truth in the future for comparing
> the behaviour and approach of different tools to solve a common, easily
> comprehended problem.
> This ticket formalizes the conversation in mailing list archives regarding
> the BigPetStore proposal.
> At the moment, (with the exception of word count), there are very few
> examples of bigdata problems that have been solved by a variety of different
> technologies. And, even with wordcount, there arent alot of templates which
> can be customized for applications.
> Comparatively: Other application developer communities (i.e.the Rails folks,
> those using maven archetypes, etc.. ) have a plethora of template
> applications which can be used to kickstart their applications and use cases.
>
> This big pet store JIRA thus aims to do the following:
> 0) Curate a single, central, standard input data set .
> 1) Define a big data processing pipeline (using the pet store theme - except
> morphing it to be analytics rather than transaction oriented).
> 2) Define specific nodes as DAG in the pipeline. For example, the first step
> would be to ETL some text files into the Hadoop cluster's default FileSystem.
> Each task would have multiple implementations, using different component of
> the bigtop stack. Using word count as the example: We would have word count
> implementations in pig, hive, mapreduce, spark, etc... and unit tests which
> confirmed that each one accomplished the exact same output.
> 3) Package the project as a maven archetype that can be used to create new
> java based bigdata processor frameworks.
> Some implementation details -- open to change these, please comment/review --
> .
> - initial data source will be raw text or (better yet) some kind of
> automatically generated data.
> - the source will initially go in bigtop/blueprints
> - the application sources can be in any modern JVM language
> (java,scala,groovy,clojure), since bigtop supports scala, java, groovy
> natively already and clojure is easy to support with the right jars.
> - each "job" will be named according to the corresponding DAG of the big data
> pipeline .
> - all jobs will be controlled by a global program (maybe oozie?) which runs
> the tasks in order, and can easily be customized to use different tools at
> different stages.
> - for now, all outputs will be to files: so that users don't require servers
> to run the app.
> - final data sinks will be into a highly available transaction oriented store
> (solr/hbase/riak/...)
> This ticket will be completed once a first iteration of BigPetStore is
> complete using 3 ecosystem components, along with a depiction of the pipeline
> which can be used for development.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira