Re: Need help to map a problem to Mapreduce domain

Vinayak Borkar Thu, 29 Mar 2012 13:23:24 -0700

Hi Praveen,

The way your problem is stated, requires in the worst case that allcities appear at every reducer. The simplest way to do so is to have onereducer -- but this is a sequential solution and probably not what youare looking for.

If you have more visibility into your similarity function you can dobetter. Look at http://asterix.ics.uci.edu/pub/sigmod10-vernica-long.pdffor trying to solve a similar problem for set similarity joins.

One other approach you could use (if the number of unique cities isfairly small), is to first run a MapReduce job to compute the distinctcities (duplicate eliminated). Then do a map-only job where each mapperuses the distinct list of cities to perform the "similarity join" withthe data in its HDFS block.


Hope this helps.

Vinayak


On 3/29/12 1:05 PM, Praveen Kumar K J V S wrote:

Hi All,

I have already posted my question to the MapReduce users mailing list, but
alas I did not get any response. Probably I did not convey my question
correctly, so I thought I will rephrase my question and post it in dev list.

Kindly give your suggestions.

I have a many files HDFS each containing list of cities. For each city in
any document I want to find a similar city that appear in any of the
documents. I have a utility method that says the level of similarity b/w 2
cities, re turning a value b/w 0 -1.

Is there a way of doing this in Hadoop. I have specific doubt because, a
city might be similar to another city present in some other input split
that is processed by another mapper.  Lets say odd cities (C1, C3, C5) are
similar

Input Split 1 has the cities: C1, C2, C3, C4
Input Split 2 has the cities: C1, C2, C5, C6

Say my mapper 1 o/p is: since odd cities (C1, C3, C5) are similar

C1, C3,
C2, C4
C3, C1
C4, C2

Similar for mapper 2.
C1, C5
C2, C5
C5, C1
C6, C2

Since C1 appears in both the splits, is finally at my reducer I get C3, C5
for key C1, But this does not happen for C3, since it appears in only one
split,

Thanks,
Praveen

Re: Need help to map a problem to Mapreduce domain

Reply via email to