[ 
https://issues.apache.org/jira/browse/BIGTOP-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013795#comment-14013795
 ] 

jay vyas commented on BIGTOP-1272:
----------------------------------

[~bhashit] are you interested in picking this up ?  Heres an outline of what we 
should do i think.  Others welcome to chime in of course.

1) Add an external mahout 2 repo, so building isnt required.  For example the 
HDP servers open maven repos compile for 2x.

2) Add Back in the mahout recommender and write an integration test (one that 
is like we have for pig)  To do this, we will need to 
- create a mock input file of integer,integer,1|0 tuples.  The mock file should 
have similar "users" (column 1) , for example: 
{noformat}
1,100,1
2,100,1
2,200,1
{noformat}
In the above user "2" is similar to "1" (they both like product "100").   So we 
would like to see that 1 has a recommendation to buy "200" in the output. 

- write the java code to call the mahout parallel ALS job directly via the API, 
taking above mock input file as input.
- tune the parameters so that at small # of results, some recommendations are 
still made (i.e. so that integration tests can be done fast, but locally ) 

After that point, the "prototyping" will be done - and we can move forward with 

3) Embedding user types in the data (i.e. BIGTOP-1271).  That will mean that 
the data produced by the data set generator has meaningfull user trends which 
can be used as input to the recommender.  I like the idea of using scala to 
redo it, as you showed me offline in http://pastebin.com/wHXCEuk4

4) Create a new pig script "BPS_transactions.pig" (like BPS_Analytics.pig) to 
output a 3 column hashcode file which we will use for the "real" input to 
mahout in the actual integration tests / cluster .  Maybe w/ a python udf for 
the hashing of products and users.  I will provide that as a patch in this JIRA 
and we can add it in the overall JIRA when you finish 1-3.  This will allow us 
to keep bigpetstore moving forward inspite of  (see BIGTOP-1270 / HIVE-7115 on 
why hive is difficult to run in  bigpetstore at the moment ).  

5) Match up this with BIGTOP-1327 (updated arch.dot diagram) to ensure that the 
architecture is matched correctly. 

6) update the arch.dot command with the exact commands (i.e. "hadoop jar 
bps.jar BPSRecommender -in ... -out...")

At that point, we will do some testing of the bigpetstore jar file, based on 6, 
in the cluster, and then commit the next iteration of bigpetstore !

> BigPetStore: Productionize the Mahout recommender
> -------------------------------------------------
>
>                 Key: BIGTOP-1272
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-1272
>             Project: Bigtop
>          Issue Type: New Feature
>          Components: Blueprints
>    Affects Versions: backlog
>            Reporter: jay vyas
>
> BIGTOP-1271 adds patterns into the data that gaurantee that a meaningfull 
> type of product recommendation can be given for at least *some* customers, 
> since we know that there are going to be many customers who only bought 1 
> product, and also customers that bought 2 or more products -- even in a 
> dataset size of 10. due to the gaussian distribution of purchases that is 
> also in the dataset generator. 
> The current mahout recommender code is statically valid: It runs to 
> completion in local unit tests if a hadoop 1x tarball is present but 
> otherwise it hasn't been tested at scale.  So, lets get it working.  this 
> JIRA also will comprise:
> - deciding wether to use mahout 2x for unit tests (default on mahout maven 
> repo is the 1x impl) and wether or not bigtop should host a mahout 2x jar?  
> After all, bigtop builds a mahout 2x jar as part of its packaging process, 
> and BigPetStore might thus need a mahout 2x jar in order to test against the 
> right same of bigtop releases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to