[
https://issues.apache.org/jira/browse/HAMA-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003167#comment-14003167
]
Martin Illecker commented on HAMA-904:
--------------------------------------
Thanks for your fast response!
{quote}
1) Why are user and item features broadcasted \[2] by each peer and not shared
by a global input file on HDFS? (possible performance increase?)
- Broadcasting will appear only once when BSP job started (maybe not huge
performance increase, even if we fix it somehow).
- We dont have multiple input formats, thats why I needed to do
partitioning on BSP superstep
- Lets say we have 10Gb user and item features(maybe not realistic) and we
have 10peers, if we use global file, every peer probably
will try to get all features (if I understand correctly your statement
"global input file on HDFS") which means
a) 10Gb*10 network traffic
b) filtration logic among 10Gb data in each peer
c) memory overhead
And I dont know exact way to get user/item feature we are interested in.
Why I did it in superstep, probably because 10Gb will be splitted by
partitioner and sent to peers (10Gb traffic total) and then each peer will
ask for interested features which I guess less bandwidth in general.
{quote}
I don't know if it is the optimal way but I agree with you that broadcasting
user and item features might cause less network traffic than sharing a global
file.
But I think I have to mention that you combine the user ("u") and item ("i")
features with the user/item ratings (preferences "p") in one input file. If the
user and item features would not be in the same input file we could partition
the input by the user or item id of the preference. These are just some
thoughts for further improvements.
{quote}
2) Use o.a.h.c.u.KeyValuePair instead of commons.math3.util.Pair in
UserSimilarity \[3] and ItemSimilarity \[4]
Probably thats because of taste, not big deal but, I guess at that time I
thought KeyValuePair should be used
for things which represents key/value usually hash, search related things. Sure
userId/itemId also can be considered as key,
but Pair makes them equal in terms of sense, you want to get userId ok get from
pair.first, you want to get score ok get it from pair.second
Thats not big deal, could be changed easily.
{quote}
I only would suggest to use our own *o.a.h.c.u.KeyValuePair* class instead of
the *commons.math3.util* package.
Just a suggestion to remove the external dependency but if you prefer the
terminology of a Pair then there is no problem.
{quote}
3) No need for normalizeWithBroadcastingValues \[5] when taskNum = 1
thats true Agree
4) Why are values sent to itself and received later? \[6] (possible performance
increase?)
Yes, possible performance increase. Agree
{quote}
Based on your implementation, I built an easier and smaller one without user
and item feature support.
You can have a look at \[1] for possible improvements within the
*normalizeWithBroadcastingValues* method.
{quote}
6) Why is the default ALPHA value 0.01 and not 0.001? \[8] (if (matrixRank >
250) -> Infinite / NaN)
Lack of knowledge in ML from my side. I just picked value which worked for me,
thats bad for sure.
{quote}
An easy solution would be to make this TETTA / ALPHA constant configurable.
Thanks for your time!
\[1]
https://github.com/millecker/applications/blob/master/hama/hybrid/onlinecf/src/at/illecker/hama/hybrid/examples/onlinecf/OnlineCFTrainHybridBSP.java#L369-456
> Fix Collaborative Filtering Example
> -----------------------------------
>
> Key: HAMA-904
> URL: https://issues.apache.org/jira/browse/HAMA-904
> Project: Hama
> Issue Type: Bug
> Components: examples, machine learning
> Affects Versions: 0.6.4
> Reporter: Martin Illecker
> Priority: Minor
> Labels: collaborative-filtering, examples, machine_learning
> Fix For: 0.7.0
>
>
> *Fix Collaborative Filtering Example and revise test case.*
> I had a deep look into the collaborative filtering example of Ikhtiyor
> Ahmedov \[1] and found the following questions / problems:
> - Why are user and item features broadcasted \[2] by each peer and not
> shared by a global input file on HDFS? (possible performance increase?)
> - Use o.a.h.c.u.KeyValuePair instead of commons.math3.util.Pair in
> UserSimilarity \[3] and ItemSimilarity \[4]
> - No need for normalizeWithBroadcastingValues \[5] when taskNum = 1
> - Why are values sent to itself and received later? \[6] (possible
> performance increase?)
> - Why is every task saving all items? \[7] (duplicate saves?)
> - Why is the default ALPHA value 0.01 and not 0.001? \[8] (if (matrixRank >
> 250) -> Infinite / NaN)
> I hope Ikhtiyor Ahmedov will finally become a committer and helps us to solve
> these questions.
> Thanks!
> \[1] https://issues.apache.org/jira/browse/HAMA-612
> \[2]
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L116-128
> \[3]
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/UserSimilarity.java#L22
> \[4]
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/ItemSimilarity.java#L22
> \[5]
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L138
> \[6]
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L323
> \[7]
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L387-422
> \[8]
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/function/MeanAbsError.java#L62
--
This message was sent by Atlassian JIRA
(v6.2#6252)