I have the problem of performing a operation of a data set on itself.
Assume, for example, that I have a list of people and their addresses and for each person I want the ten closest members of the set. (this is not the problem but illustrated critical aspects). I know that the ten closest people will be in the same zipcode or a neighboring zip code. This means unless the database is very large I can have the mapper send every person out with keys representing their zipcode and also keys representing the neighboring zip codes. In the reducer I can keep all people in memory and compute distances between them (assume the distance computation is slightly expensive). The problem is that this approach will not scale - eventually the number of people assigned to a zip code will exceed memory. In the current problem the number of "people" is about 100 million and doubling every 6 months. The size of a "zipcode" requires keeping about 100,000 items in memory - doable today but marginal in terms of future growth. Are there other ways to solve the problem. I considered keeping a random subset, finding the closest in that subset and then repeating with different random subsets. The solution of midifying the splitter to generate all pairs https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java will not work for a dataset with 100 million items Any bright ideas?