[ https://issues.apache.org/jira/browse/BIGTOP-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122855#comment-14122855 ]
jay vyas commented on BIGTOP-1366: ---------------------------------- Update RJ Has ported his initial python based generator to java, which will serve as the seed for this https://github.com/rnowling/bigpetstore-data-generator/ . > Updated, Richer Model for Generating Data for BigPetStore > ---------------------------------------------------------- > > Key: BIGTOP-1366 > URL: https://issues.apache.org/jira/browse/BIGTOP-1366 > Project: Bigtop > Issue Type: Improvement > Components: blueprints > Affects Versions: backlog > Reporter: RJ Nowling > Assignee: RJ Nowling > Priority: Minor > Original Estimate: 8,736h > Remaining Estimate: 8,736h > > BigPetStore uses synthetic data as the basis for its workflow. BPS's current > model for generating customer data is sufficient for basic testing of the > Hadoop ecosystem, **but the model is very basic and lacks sufficient > complexity for embedding interesting patterns into the data**. > As a result, **more complex, scalable testing such as testing clustering > algorithms in Mahout on non-trivial data or multidimensional data with > factors influencing it** is not currently possible. > Efforts are currently underway to incrementally improve the current model > (see BIGTOP-1271 and BIGTOP-1272). > To create a model that can that incorporate **realistic, non-hierarchichal > patterns** and input data to generate rich customer/transaction data with > interesting correlations will require a re-imagining of the current model and > its framework. > To support the improvements to the model in BigPetStore, I have been working > on an **alternative ab initio model, developed from scratch**. Since the > development of a new model involves substantial R&D work with more > specialized tools (mathematical and plotting libraries), I'm doing the > current work outside of BPS using the iPython Notebook environment. Due to > the long time frame, the model will be developed on a separate timeline to > prevent slowing the development of BPS. > Once the model has stabilized, I will begin incorporating the model into BPS > itself. One option is to implement the model in using Scala for clean > integration with **spark** which is likely to play an increasingly important > role in the hadoop ecosystem, and thus will be an important part of > bigpetstore as a test/blueprint app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)