[
https://issues.apache.org/jira/browse/BIGTOP-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118773#comment-14118773
]
jay vyas edited comment on BIGTOP-1366 at 9/2/14 9:21 PM:
----------------------------------------------------------
Hi jorn. this has been done by RJ, but not implemented in mapreduce or spark
yet.
Should we consider *implementing this in spark as a prerequisite* to
BIGTOP-1414 , and let the spark code generate , and process, its own data?
The implementation is already existing in python. we have tested it quite
thouroughly, and it does something almost identical
to what you have mentioned here.
You can check out the diagram of the data model here:
https://github.com/rnowling/bigpetstore-data-generator/raw/master/bdcloud_paper/latex/paper.pdf
.
And the python code is also in that repository - (the code can be easily
translated to scala, i can help or at least get this started if need be).
was (Author: jayunit100):
Hi jorn. this has been done by RJ, but not implemented in mapreduce or spark
yet.
Should we consider *implementing this in spark as a prerequisite* to
BIGTOP-1414 , and let the spark code generate , and process, its own data?
The implementation is already existing in python. we have tested it quite
thouroughly, and it does something almost identical
to what you have mentioned here.
You can check out the implementation here:
https://github.com/rnowling/bigpetstore-data-generator/raw/master/bdcloud_paper/latex/paper.pdf
.
(the code can be easily translated to scala, i can help or at least get this
started if need be).
> Updated, Richer Model for Generating Data for BigPetStore
> ----------------------------------------------------------
>
> Key: BIGTOP-1366
> URL: https://issues.apache.org/jira/browse/BIGTOP-1366
> Project: Bigtop
> Issue Type: Improvement
> Components: blueprints
> Affects Versions: backlog
> Reporter: RJ Nowling
> Assignee: RJ Nowling
> Priority: Minor
> Original Estimate: 8,736h
> Remaining Estimate: 8,736h
>
> BigPetStore uses synthetic data as the basis for its workflow. BPS's current
> model for generating customer data is sufficient for basic testing of the
> Hadoop ecosystem, **but the model is very basic and lacks sufficient
> complexity for embedding interesting patterns into the data**.
> As a result, **more complex, scalable testing such as testing clustering
> algorithms in Mahout on non-trivial data or multidimensional data with
> factors influencing it** is not currently possible.
> Efforts are currently underway to incrementally improve the current model
> (see BIGTOP-1271 and BIGTOP-1272).
> To create a model that can that incorporate **realistic, non-hierarchichal
> patterns** and input data to generate rich customer/transaction data with
> interesting correlations will require a re-imagining of the current model and
> its framework.
> To support the improvements to the model in BigPetStore, I have been working
> on an **alternative ab initio model, developed from scratch**. Since the
> development of a new model involves substantial R&D work with more
> specialized tools (mathematical and plotting libraries), I'm doing the
> current work outside of BPS using the iPython Notebook environment. Due to
> the long time frame, the model will be developed on a separate timeline to
> prevent slowing the development of BPS.
> Once the model has stabilized, I will begin incorporating the model into BPS
> itself. One option is to implement the model in using Scala for clean
> integration with **spark** which is likely to play an increasingly important
> role in the hadoop ecosystem, and thus will be an important part of
> bigpetstore as a test/blueprint app.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)