Hi Sean, > Really, I have never run this code in a real Hadoop environment. There > could be bugs, or improvements, that fall out from that. For example > there might be some more efficient way to use Hadoop that I don't see. > I don't have anything specific in mind -- these are unknown-unknowns > to me. But I think this could form part of a decent project.
Okay. I won't comment on this before I get to know slope one. > This would be a fantastic project, implementing a Recommender based on > this approach . I tried implementing an SVD technique a couple years > ago and it was waaay too slow on one machine. Revisiting with Hadoop > sounds great. Glad that you are so positive about this. I just googled and found the article addressing parallel SVD [1], which was devised by Google. I shall spend some time reading this. If we are really going to do this project, implementing only the SVD part would be, in my opinion, good enough. We can leave implementation of those algorithm relying on SVD as later work. > It's interesting, and I personally find this a worthy project too. On > my list of priorities, I don't find a Recommender that prioritizes > privacy or minimizing information sharing as compelling. In most > real-world cases where exposing preference data might be a concern, I > think it can be solved by just using opaque user/item IDs or > something. But, I wouldn't object if someone thought they could > implement this usefully. Privacy was not my concern. I was talking about whether we can get some inspiration from the idea that the CF process can be distributed across multiple nodes, though unfortunately, I haven't got a clue :( [1] Gengxin Miao, Yangqiu Song, Dong Zhang, and Hongjie Bai. Parallel Spectral Clustering Algorithm for Large-Scale Community Data Mining. http://yqsong.googlepages.com/swsm08_submission_16.pdf. 2008. -- Yin Qiu
