Re: Need help to map a problem to Mapreduce domain

Praveen Kumar K J V S Thu, 29 Mar 2012 13:32:02 -0700

Thanks, very kind of you.


On Fri, Mar 30, 2012 at 2:00 AM, Vinayak Borkar <[email protected]> wrote:

> Sorry about the broken link.
>
> Here is one that works.
>
> http://flamingo.ics.uci.edu/**pub/sigmod10-vernica.pdf<http://flamingo.ics.uci.edu/pub/sigmod10-vernica.pdf>
>
> Vinayak
>
>
>
> On 3/29/12 1:27 PM, Praveen Kumar K J V S wrote:
>
>> Hi Vinayak,
>>
>> Thanks. If I use a single reduce I might run out of memory in that reduce
>> JVM.
>>
>> BTW URL is not accessible.
>>
>> Thanks,
>> Praveen
>>
>> On Fri, Mar 30, 2012 at 1:52 AM, Vinayak Borkar<[email protected]>  wrote:
>>
>>  Hi Praveen,
>>>
>>> The way your problem is stated, requires in the worst case that all
>>> cities
>>> appear at every reducer. The simplest way to do so is to have one reducer
>>> -- but this is a sequential solution and probably not what you are
>>> looking
>>> for.
>>>
>>> If you have more visibility into your similarity function you can do
>>> better. Look at http://asterix.ics.uci.edu/****
>>> pub/sigmod10-vernica-long.pdf<http://asterix.ics.uci.edu/**pub/sigmod10-vernica-long.pdf>
>>> <**http://asterix.ics.uci.edu/**pub/sigmod10-vernica-long.pdf<http://asterix.ics.uci.edu/pub/sigmod10-vernica-long.pdf>
>>> >**for trying to solve a similar problem for set similarity joins.
>>>
>>>
>>> One other approach you could use (if the number of unique cities is
>>> fairly
>>> small), is to first run a MapReduce job to compute the distinct cities
>>> (duplicate eliminated). Then do a map-only job where each mapper uses the
>>> distinct list of cities to perform the "similarity join" with the data in
>>> its HDFS block.
>>>
>>> Hope this helps.
>>>
>>> Vinayak
>>>
>>>
>>>
>>> On 3/29/12 1:05 PM, Praveen Kumar K J V S wrote:
>>>
>>>  Hi All,
>>>>
>>>> I have already posted my question to the MapReduce users mailing list,
>>>> but
>>>> alas I did not get any response. Probably I did not convey my question
>>>> correctly, so I thought I will rephrase my question and post it in dev
>>>> list.
>>>>
>>>> Kindly give your suggestions.
>>>>
>>>> I have a many files HDFS each containing list of cities. For each city
>>>> in
>>>> any document I want to find a similar city that appear in any of the
>>>> documents. I have a utility method that says the level of similarity
>>>> b/w 2
>>>> cities, re turning a value b/w 0 -1.
>>>>
>>>> Is there a way of doing this in Hadoop. I have specific doubt because, a
>>>> city might be similar to another city present in some other input split
>>>> that is processed by another mapper.  Lets say odd cities (C1, C3, C5)
>>>> are
>>>> similar
>>>>
>>>> Input Split 1 has the cities: C1, C2, C3, C4
>>>> Input Split 2 has the cities: C1, C2, C5, C6
>>>>
>>>> Say my mapper 1 o/p is: since odd cities (C1, C3, C5) are similar
>>>>
>>>> C1, C3,
>>>> C2, C4
>>>> C3, C1
>>>> C4, C2
>>>>
>>>> Similar for mapper 2.
>>>> C1, C5
>>>> C2, C5
>>>> C5, C1
>>>> C6, C2
>>>>
>>>> Since C1 appears in both the splits, is finally at my reducer I get C3,
>>>> C5
>>>> for key C1, But this does not happen for C3, since it appears in only
>>>> one
>>>> split,
>>>>
>>>> Thanks,
>>>> Praveen
>>>>
>>>>
>>>>
>>>
>>
>

Re: Need help to map a problem to Mapreduce domain

Reply via email to