Hello list, as a newbie I got a tricky use-case in mind which I want to implement with Hadoop to train my skillz. There is no real scenario behind that, so I can extend or shrink the problem to the extent I like.
I create random lists of person-IDs and places plus a time-value. The result of my map-reduce-operations should be something like that: The key is a place and the value is a list of places that were visited by persons after they visited the key-place. Additionally the value should be sorted in a way were I use some time/count-biased metric. This way the value-list should reflect the place which was the most popular i.e. second-station on a tour. I think this is a complex almost real-world-scenario. In pseudo-code it will be something like this: for every place p for every person m that visited p select list l of all the places that m visited after p write a key-value-pair p=>l to disc and l is in order of the visits for every key k in the list of key-value-pairs get the value list of places v for k - create another key-value-pair pv where the key is the place and the value is its index in v (for a place p in v) for every k get all pv for every pv aggregate the key-value-pairs by key and sum up the index i for every place p so that it becomes the kv-pair opv sort opv in ascending order by its value The result would be what I wanted, no? It looks like I need multiple MR-phases, however I do not even know how to start. My first guess is: Create a MR-Job where I invert my list so that I got a place as the key and as value all persons that visited it. The next phase needs to iterate over the value's persons and join with the original data to get an idea of when this person visited this place and what places came next. And now the problems arise: - First: What happens to places that are so popular that the number of persons that visited it is so large, that I can not pass the whole KV-pair to a single node to iterate over it? - Second: I need to re-join the original data. Without a database this would be extremely slow, wouldn't it? I hope that you guys can give me some ideas and input to make my first serious steps in Hadoop-land. Regards, Em