RE: Re: Re: Hash Partitioner

Deepika Khera Tue, 25 May 2010 14:03:31 -0700

So I ran my process again with some more logging and here is what I see. 

I used my own HashPartitioner(basically copied hadoop's partitioner and added 
some logging for analysis).I printed here the Key and the reducer that is 
assigned to the key (based on the hash code).


My process triggered off 2 mappers (running on 2 different hadoop machines), 
hence both of these try to find reducers for the split file assigned to them. I 
see that for a same object key assigned to both these mappers, I am getting 2 
different reducers allocated by the Partitioner.

In the reducers I see -

1) 2 different reducers (the ones that the partitioner assigned the key to) 
printing out the same Key (I did not print out the value as I thought that 
wouldn't matter)
2) Here are the logs from where reducers copies data from the mapper -
        
Reducer1:

2010-05-25 11:34:49,810 INFO org.apache.hadoop.mapred.ReduceTask: Read 1002612 
bytes from map-output for attempt_201005251129_0001_m_000001_0
2010-05-25 11:34:49,831 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
attempt_201005251129_0001_m_000001_0 -> (127, 36) from hadoop-49.c.a.com
2010-05-25 11:34:50,797 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005251129_0001_r_000006_0: Got 1 new map-outputs
2010-05-25 11:34:54,835 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005251129_0001_r_000006_0 Scheduled 1 outputs (0 slow hosts and0 dup 
hosts)
2010-05-25 11:34:54,841 INFO org.apache.hadoop.mapred.ReduceTask: header: 
attempt_201005251129_0001_m_000000_0, compressed len: 1553902, decompressed 
len: 1553898
2010-05-25 11:34:54,841 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
1553898 bytes (1553902 raw bytes) into RAM from 
attempt_201005251129_0001_m_000000_0
2010-05-25 11:34:54,924 INFO org.apache.hadoop.mapred.ReduceTask: Read 1553898 
bytes from map-output for attempt_201005251129_0001_m_000000_0
2010-05-25 11:34:54,944 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
attempt_201005251129_0001_m_000000_0 -> (143, 36) from hadoop-25.c.a.com


Reducer2: 
 
2010-05-25 11:34:49,822 INFO org.apache.hadoop.mapred.ReduceTask: Read 637657 
bytes from map-output for attempt_201005251129_0001_m_000001_0
2010-05-25 11:34:49,911 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
attempt_201005251129_0001_m_000001_0 -> (125, 36) from hadoop-49.c.a.com
2010-05-25 11:34:50,806 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005251129_0001_r_000008_0: Got 1 new map-outputs
2010-05-25 11:34:54,915 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005251129_0001_r_000008_0 Scheduled 1 outputs (0 slow hosts and0 dup 
hosts)
2010-05-25 11:34:54,920 INFO org.apache.hadoop.mapred.ReduceTask: header: 
attempt_201005251129_0001_m_000000_0, compressed len: 1462335, decompressed 
len: 1462331
2010-05-25 11:34:54,920 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
1462331 bytes (1462335 raw bytes) into RAM from 
attempt_201005251129_0001_m_000000_0
2010-05-25 11:34:54,937 INFO org.apache.hadoop.mapred.ReduceTask: Read 1462331 
bytes from map-output for attempt_201005251129_0001_m_000000_0
2010-05-25 11:34:54,937 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
attempt_201005251129_0001_m_000000_0 -> (147, 36) from hadoop-25.c.a.com


The 2 reduce tasks have different task ids and belong to the same job.

Thanks,
Deepika

-----Original Message-----
From: Eric Sammer [mailto:[email protected]] 
Sent: Tuesday, May 25, 2010 8:10 AM
To: [email protected]
Subject: Re: Re: Hash Partitioner

On Mon, May 24, 2010 at 6:32 PM, Deepika Khera <[email protected]> wrote:
> Thanks for your response Eric.
>
> I am using hadoop 0.20.2.
>
> Here is what the hashCode() implementation looks like (I actually had the IDE 
> generate it for me)
>
> Main key (for mapper & reducer):
>
> public int hashCode() {
>        int result = kVersion;
>        result = 31 * result + (aKey != null ? aKey.hashCode() : 0);
>        result = 31 * result + (gKey != null ? gKey.hashCode() : 0);
>        result = 31 * result + (int) (date ^ (date >>> 32));
>        result = 31 * result + (ma != null ? ma.hashCode() : 0);
>        result = 31 * result + (cl != null ? cl.hashCode() : 0);
>        return result;
>    }
>
>
> aKey : AKey class
>
>
>    public int hashCode() {
>        int result = kVersion;
>        result = 31 * result + (v != null ? v.hashCode() : 0);
>        result = 31 * result + (s != null ? s.hashCode() : 0);
>        result = 31 * result + (o != null ? o.hashCode() : 0);
>        result = 31 * result + (l != null ? l.hashCode() : 0);
>        result = 31 * result + (e ? 1 : 0); //boolean
>        result = 31 * result + (li ? 1 : 0); //boolean
>        result = 31 * result + (aut ? 1 : 0); //boolean
>        return result;
>    }
>

Both of these look fine, assuming all the other hashCode()s return the
same value every time.

> When this happens, I do see the same values for the key. Also I am not using 
> a grouping comparator.

So you see two reduce methods getting the same key with the same
values? That's extremely odd. If this is the case, there's a bug in
Hadoop. Can you find the relevant logs from the reducers where Hadoop
fetches the map output? Does it look like its fetching the same output
twice? Do the two tasks where you see the duplicates have the same
task ID? Can you confirm the reduce tasks are from the same job ID for
us?

> I was wondering since the call to HashPartitioner.getPartition() is done from 
> a map task, several of which are running on different machines, is it 
> possible that they get a different hashcode and hence get different reducers 
> assigned even when the key is the same.

The hashCode() result should *always* be the same given the same
internal state. In other words, it should be consistent and stable. If
I have a string new String("hello world") it will always have the
exact same hashCode(). If this isn't true, you will get wildly
unpredictable results not just with Hadoop but with Java's
comparators, collections, etc.

-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

RE: Re: Re: Hash Partitioner

Reply via email to