[ 
https://issues.apache.org/jira/browse/HAMA-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003167#comment-14003167
 ] 

Martin Illecker commented on HAMA-904:
--------------------------------------

Thanks for your fast response!

{quote}
1) Why are user and item features broadcasted \[2] by each peer and not shared 
by a global input file on HDFS? (possible performance increase?)
    - Broadcasting will appear only once when BSP job started (maybe not huge 
performance increase, even if we fix it somehow).
    - We dont have multiple input formats, thats why I needed to do 
partitioning on BSP superstep
    - Lets say we have 10Gb user and item features(maybe not realistic) and we 
have 10peers, if we use global file, every peer probably
    will try to get all features (if I understand correctly your statement 
"global input file on HDFS") which means
    a) 10Gb*10 network traffic
    b) filtration logic among 10Gb data in each peer
    c) memory overhead
    And I dont know exact way to get user/item feature we are interested in. 
Why I did it in superstep, probably because 10Gb will be splitted by
    partitioner and sent to peers (10Gb traffic total) and then each peer will 
ask for interested features which I guess less bandwidth in general.
{quote}
I don't know if it is the optimal way but I agree with you that broadcasting 
user and item features might cause less network traffic than sharing a global 
file.
But I think I have to mention that you combine the user ("u") and item ("i") 
features with the user/item ratings (preferences "p") in one input file. If the 
user and item features would not be in the same input file we could partition 
the input by the user or item id of the preference. These are just some 
thoughts for further improvements.

{quote}
2) Use o.a.h.c.u.KeyValuePair instead of commons.math3.util.Pair in 
UserSimilarity \[3] and ItemSimilarity \[4]
Probably thats because of taste, not big deal but, I guess at that time I 
thought KeyValuePair should be used
for things which represents key/value usually hash, search related things. Sure 
userId/itemId also can be considered as key,
but Pair makes them equal in terms of sense, you want to get userId ok get from 
pair.first, you want to get score ok get it from pair.second
Thats not big deal, could be changed easily.
{quote}
I only would suggest to use our own *o.a.h.c.u.KeyValuePair* class instead of 
the *commons.math3.util* package.
Just a suggestion to remove the external dependency but if you prefer the 
terminology of a Pair then there is no problem.

{quote}
3) No need for normalizeWithBroadcastingValues \[5] when taskNum = 1
thats true Agree
4) Why are values sent to itself and received later? \[6] (possible performance 
increase?)
Yes, possible performance increase. Agree
{quote}
Based on your implementation, I built an easier and smaller one without user 
and item feature support.
You can have a look at \[1] for possible improvements within the 
*normalizeWithBroadcastingValues* method.

{quote}
6) Why is the default ALPHA value 0.01 and not 0.001? \[8] (if (matrixRank > 
250) -> Infinite / NaN)
Lack of knowledge in ML from my side. I just picked value which worked for me, 
thats bad for sure.
{quote}
An easy solution would be to make this TETTA / ALPHA constant configurable.

Thanks for your time!

\[1] 
https://github.com/millecker/applications/blob/master/hama/hybrid/onlinecf/src/at/illecker/hama/hybrid/examples/onlinecf/OnlineCFTrainHybridBSP.java#L369-456

> Fix Collaborative Filtering Example
> -----------------------------------
>
>                 Key: HAMA-904
>                 URL: https://issues.apache.org/jira/browse/HAMA-904
>             Project: Hama
>          Issue Type: Bug
>          Components: examples, machine learning
>    Affects Versions: 0.6.4
>            Reporter: Martin Illecker
>            Priority: Minor
>              Labels: collaborative-filtering, examples, machine_learning
>             Fix For: 0.7.0
>
>
> *Fix Collaborative Filtering Example and revise test case.*
> I had a deep look into the collaborative filtering example of Ikhtiyor 
> Ahmedov \[1] and found the following questions / problems:
>  - Why are user and item features broadcasted \[2] by each peer and not 
> shared by a global input file on HDFS? (possible performance increase?)
>  - Use o.a.h.c.u.KeyValuePair instead of commons.math3.util.Pair in 
> UserSimilarity \[3] and ItemSimilarity \[4]
>  - No need for normalizeWithBroadcastingValues \[5] when taskNum = 1
>  - Why are values sent to itself and received later? \[6] (possible 
> performance increase?)
>  - Why is every task saving all items? \[7] (duplicate saves?)
>  - Why is the default ALPHA value 0.01 and not 0.001? \[8] (if (matrixRank > 
> 250) -> Infinite / NaN)
> I hope Ikhtiyor Ahmedov will finally become a committer and helps us to solve 
> these questions.
> Thanks!
> \[1] https://issues.apache.org/jira/browse/HAMA-612
> \[2] 
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L116-128
> \[3] 
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/UserSimilarity.java#L22
> \[4] 
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/ItemSimilarity.java#L22
> \[5] 
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L138
> \[6] 
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L323
> \[7] 
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L387-422
> \[8] 
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/function/MeanAbsError.java#L62



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to