[
https://issues.apache.org/jira/browse/BIGTOP-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040689#comment-14040689
]
bhashit parikh commented on BIGTOP-1272:
----------------------------------------
After giving the whole process a lot of thought, I have finalized upon the
following approach for getting the whole thing done:
# Use {{hadoop}} java API to write customer records (TSV). Each record will
contain (id, firstName, lastName, state).
# Use the same approach for writing a TSV containing details of the product
(id, name, price). We could have skipped this step but for the fact that
{{mahout}} will require product ids to process the data.
# Once these two sets of base records are generated, generate transaction
records containing the the customer ids and products.
# The generated transaction records will simulate real world buying patterns in
that the customers who generally buy dog products will buy dog products most of
the time, but once in a while they may buy some other product as well. However,
the frequency of that happening will be very low.
# The weight currently given to each state will be used for generating customer
records so that we have larger number of customer from states with higher
weight.
# While generating the transaction records, we will need to access the customer
ids (which are randomly generated). Use pig to read the customer ids from the
customer records file.
# Create a new {{pig}} script to translate the transaction records to the
format required by {{mahout}} recommender.
# Peform paraller ALS recommendation
# At this stage, the recommendations are done. However, they will be in a
format like {{1 100 102}}. In order to make them more readable, we can run some
{{pig}} code that generates another file by reading from both the transaction
records and the output of {{mahout}} recommender to generate some output like
{{id:1, bought: dog_collar,dog_food, recommended: dog_leash}}, or something to
that effect. [~jayunit100] what do you think?
> BigPetStore: Productionize the Mahout recommender
> -------------------------------------------------
>
> Key: BIGTOP-1272
> URL: https://issues.apache.org/jira/browse/BIGTOP-1272
> Project: Bigtop
> Issue Type: New Feature
> Components: Blueprints
> Affects Versions: backlog
> Reporter: jay vyas
> Attachments: arch.jpeg
>
>
> BIGTOP-1271 adds patterns into the data that gaurantee that a meaningfull
> type of product recommendation can be given for at least *some* customers,
> since we know that there are going to be many customers who only bought 1
> product, and also customers that bought 2 or more products -- even in a
> dataset size of 10. due to the gaussian distribution of purchases that is
> also in the dataset generator.
> The current mahout recommender code is statically valid: It runs to
> completion in local unit tests if a hadoop 1x tarball is present but
> otherwise it hasn't been tested at scale. So, lets get it working. this
> JIRA also will comprise:
> - deciding wether to use mahout 2x for unit tests (default on mahout maven
> repo is the 1x impl) and wether or not bigtop should host a mahout 2x jar?
> After all, bigtop builds a mahout 2x jar as part of its packaging process,
> and BigPetStore might thus need a mahout 2x jar in order to test against the
> right same of bigtop releases.
--
This message was sent by Atlassian JIRA
(v6.2#6252)