Re: How do I perform a scalable cartesian product

Christoph Schmitz Thu, 15 Aug 2013 04:02:51 -0700

Hi Steve,

if the only problem is that the size of your zipcode squared (moreaccurately, n * (n-1) if you order the pairs of persons, assuming thatdistance is symmetric) is too large, it might help to split the zipcodeinto buckets by some hash function and partition the all-pairscomputations over buckets.

That is, if you have 10 buckets containing the users of a zipcode andits neighboring zipcodes, first compute all pairs of persons withinbucket 1, then all pairs of persons in bucket 1 and 2, then 1-3, 1-4etc. up to 10-10. Obviously, you don't need to do 4-1 if you've alreadydone 1-4 (symmetry, see above), so you'll end up doing n * (n+1) pairsof buckets (55 in this case).


Basically, this means creating artificial, smaller zipcodes.

Apart from that, I'd like to point out that there's been a lot ofresearch on nearest-neighbor search; perhaps some state-of-the-artalgorithm will be applicable to your problem.


http://en.wikipedia.org/wiki/Nearest_neighbor_search

Hope this helps,

Christoph


On 14.08.2013 19:06, Steve Lewis wrote:

   I have the problem of performing a operation of a data set on itself.

    Assume, for example, that I have a list of people and their
addresses and for each person I want the ten closest members of the set.
(this is not the problem but illustrated critical aspects). I know that
the ten closest people will be in the same zipcode or a neighboring zip
code. This means unless the database is very large I can have the mapper
send every person out with keys representing  their zipcode and also
keys representing the neighboring zip codes. In the reducer I can keep
all people in memory and compute distances between them (assume the
distance computation is slightly expensive).
    The problem is that this approach will not scale - eventually the
number of people assigned to a zip code will exceed memory. In the
current problem the number of "people" is about 100 million and doubling
every 6 months. The size of a "zipcode" requires keeping about 100,000
items in memory - doable today but marginal in terms of future growth.
    Are there other ways to solve the problem. I considered keeping a
random subset, finding the closest in that subset and then repeating
with different random subsets. The solution of midifying the splitter to
generate all pairs
https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java
 will
not work for a dataset with 100 million items
    Any bright ideas?


--
Christoph Schmitz
Software-Architekt
Targeting Core Product

1&1 Internet AG | Brauerstraße 50 | 76135 Karlsruhe | Germany
Phone: +49 721 91374-6733
E-Mail: christoph.schm...@1und1.de | Web: www.1und1.de

Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 6484

Vorstand: Ralph Dommermuth, Frank Einhellinger, Robert Hoffmann, AndreasHofmann, Markus Huhn, Hans-Henning Kettler, Uwe Lamnek, Jan Oetjen,Christian Würst

Aufsichtsratsvorsitzender: Michael Scheeren

Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützteInformationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressatsind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie bitteden Absender und vernichten Sie diese Email. Anderen als dembestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern,weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu verwenden.

This E-Mail may contain confidential and/or privileged information. Ifyou are not the intended recipient of this E-Mail, you are herebynotified that saving, distribution or use of the content of this E-Mailin any way is prohibited. If you have received this E-Mail in error,please notify the sender and delete the E-Mail.

Re: How do I perform a scalable cartesian product

Reply via email to