Re: new to hadoop

Sean Owen Tue, 04 May 2010 15:30:08 -0700

I could be cheeky and point you to the book...
http://manning.com/owen


But I can also give you an overview, which is kind of what you see
from surveying the code.

RecommenderJob runs everything. It kicks off 5 different mapreduces, in order.

1. ItemIDIndexMapper / ItemIDIndexReducer
Since item IDs are longs, and vector indices are ints, we have to hash
the longs to ints, but also remember the reverse mapping for later.
That's all this does, write down the mapping.

2. ToItemPrefsMapper / ToUserVectorReducer
This converts the file of preferences into proper Vectors. Here, there
is one vector per user, and item IDs (hashed) are dimensions and
preference values are dimension values.

3. UserVectorToCooccurrenceMapper / UserVectorToCooccurrenceReducer
This is a somewhat complex step that does one thing -- counts
co-occurrence. It counts the number of times item A and item B
appeared in one user's preferences

4. CooccurrenceColumnWrapperMapper + UserVectorSplitterMapper /
PartialMultiplyReducer
This has two mappers which output one item's cooccurrences (one column
of the co-occurrence matrix), and all user preferences for that item,
in a clever way. The reducer multiplies those preference values by the
co-occurrence column, and outputs the result vectors, keyed by user.
These are part of the final recommendation vector for one user.

5. (IdentityMapper) / AggregateAndRecommendReducer
This adds up the partial vectors to make the final recommendation
vector for each user. The highest values are the recommended items.
The item index is mapped back to item ID and recommendations are
output.


That's it at a very high level, we can discuss more as you look at the code.

Sean

Re: new to hadoop

Reply via email to