[
https://issues.apache.org/jira/browse/BIGTOP-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053688#comment-14053688
]
jay vyas edited comment on BIGTOP-1366 at 7/7/14 2:18 PM:
----------------------------------------------------------
Thanks RJ. Tl:DR
* RJ is working on making the dataset generation much more sophisticated, and
plans to port it to scala some day. This is mostly *theoretical work* at the
moment.
* A requirement is that this new model can be used in any paradigm : So *we
will want to decouple the model implementation, if possible, from spark*.
* This new model will (or at least, *can*) take into account *everything*:
product inventories, customer preferences, possibly even temperature of states
etc when generating transactions. Thus it can be used to benchmark machine
learning tools in very sparse environments.
Thanks again for doing this. In the interim, I think it would be great if you
could chime in on the primitive models which we are currently using - although
they aren't as advanced as this - if we have feedback we will at least be able
to keep placeholders in the code wherever possible to pave the way for things
to come.
was (Author: jayunit100):
Thanks RJ. Tl:DR
* RJ is working on making the dataset generation much more sophisticated, and
plans to port it to scala some day. This is mostly theoretical work at the
moment.
* A requirement is that this new model can be used in any paradigm : So *we
will want to decouple the model implementation, if possible, from spark*.
* This new model will take into account everything: product inventories,
customer preferences, possibly even temperature of states etc when generating
transactions. Thus it can be used to benchmark machine learning tools in very
sparse environments.
Thanks again for doing this. In the interim, I think it would be great if you
could chime in on the primitive models which we are currently using - although
they aren't as advanced as this - if we have feedback we will at least be able
to keep placeholders in the code wherever possible to pave the way for things
to come.
> Updated, Richer Model for Generating Data for BigPetStore
> ----------------------------------------------------------
>
> Key: BIGTOP-1366
> URL: https://issues.apache.org/jira/browse/BIGTOP-1366
> Project: Bigtop
> Issue Type: Improvement
> Components: Blueprints
> Affects Versions: backlog
> Reporter: RJ Nowling
> Priority: Minor
> Original Estimate: 8,736h
> Remaining Estimate: 8,736h
>
> BigPetStore uses synthetic data as the basis for its workflow. BPS's current
> model for generating customer data is sufficient for basic testing of the
> Hadoop ecosystem, but the model is very basic and lacks sufficient complexity
> for embedding interesting patterns into the data. As a result, more complex
> testing such as testing clustering algorithms in Mahout on non-trivial data
> is not currently possible.
> Efforts are currently underway to incrementally improve the current model
> (see BIGTOP-1271 and BIGTOP-1272). However, to create a model that can that
> incorporate realistic patterns and input data to generate rich
> customer/transaction data with interesting correlations will require a
> re-imagining of the current model and its framework.
> To support the improvements to the model in BigPetStore, I have been working
> on an alternative ab initio model, developed from scratch. Since the
> development of a new model involves substantial R&D work with more
> specialized tools (mathematical and plotting libraries), I'm doing the
> current work outside of BPS using the iPython Notebook environment. Due to
> the long time frame, the model will be developed on a separate timeline to
> prevent slowing the development of BPS.
> Once the model has stabilized, I will begin incorporating the model into BPS
> itself. One option is to implement the model in Spark using Scala as a
> foundation for Spark support in BPS.
--
This message was sent by Atlassian JIRA
(v6.2#6252)