[
https://issues.apache.org/jira/browse/BIGTOP-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013795#comment-14013795
]
jay vyas commented on BIGTOP-1272:
----------------------------------
[~bhashit] are you interested in picking this up ? Heres an outline of what we
should do i think. Others welcome to chime in of course.
1) Add an external mahout 2 repo, so building isnt required. For example the
HDP servers open maven repos compile for 2x.
2) Add Back in the mahout recommender and write an integration test (one that
is like we have for pig) To do this, we will need to
- create a mock input file of integer,integer,1|0 tuples. The mock file should
have similar "users" (column 1) , for example:
{noformat}
1,100,1
2,100,1
2,200,1
{noformat}
In the above user "2" is similar to "1" (they both like product "100"). So we
would like to see that 1 has a recommendation to buy "200" in the output.
- write the java code to call the mahout parallel ALS job directly via the API,
taking above mock input file as input.
- tune the parameters so that at small # of results, some recommendations are
still made (i.e. so that integration tests can be done fast, but locally )
After that point, the "prototyping" will be done - and we can move forward with
3) Embedding user types in the data (i.e. BIGTOP-1271). That will mean that
the data produced by the data set generator has meaningfull user trends which
can be used as input to the recommender. I like the idea of using scala to
redo it, as you showed me offline in http://pastebin.com/wHXCEuk4
4) Create a new pig script "BPS_transactions.pig" (like BPS_Analytics.pig) to
output a 3 column hashcode file which we will use for the "real" input to
mahout in the actual integration tests / cluster . Maybe w/ a python udf for
the hashing of products and users. I will provide that as a patch in this JIRA
and we can add it in the overall JIRA when you finish 1-3. This will allow us
to keep bigpetstore moving forward inspite of (see BIGTOP-1270 / HIVE-7115 on
why hive is difficult to run in bigpetstore at the moment ).
5) Match up this with BIGTOP-1327 (updated arch.dot diagram) to ensure that the
architecture is matched correctly.
6) update the arch.dot command with the exact commands (i.e. "hadoop jar
bps.jar BPSRecommender -in ... -out...")
At that point, we will do some testing of the bigpetstore jar file, based on 6,
in the cluster, and then commit the next iteration of bigpetstore !
> BigPetStore: Productionize the Mahout recommender
> -------------------------------------------------
>
> Key: BIGTOP-1272
> URL: https://issues.apache.org/jira/browse/BIGTOP-1272
> Project: Bigtop
> Issue Type: New Feature
> Components: Blueprints
> Affects Versions: backlog
> Reporter: jay vyas
>
> BIGTOP-1271 adds patterns into the data that gaurantee that a meaningfull
> type of product recommendation can be given for at least *some* customers,
> since we know that there are going to be many customers who only bought 1
> product, and also customers that bought 2 or more products -- even in a
> dataset size of 10. due to the gaussian distribution of purchases that is
> also in the dataset generator.
> The current mahout recommender code is statically valid: It runs to
> completion in local unit tests if a hadoop 1x tarball is present but
> otherwise it hasn't been tested at scale. So, lets get it working. this
> JIRA also will comprise:
> - deciding wether to use mahout 2x for unit tests (default on mahout maven
> repo is the 1x impl) and wether or not bigtop should host a mahout 2x jar?
> After all, bigtop builds a mahout 2x jar as part of its packaging process,
> and BigPetStore might thus need a mahout 2x jar in order to test against the
> right same of bigtop releases.
--
This message was sent by Atlassian JIRA
(v6.2#6252)