This is just to close this one. I finally resolved my issue. Problem was that I had some enums in my key, the hash code of which is not constant over JVMs. So instead of doing myEnum.hashCode(), I should have first converted it to a string and then taken the hashcode (like myEnum.name().hashCode()). I was relying on a correct hashCode() method, since the IDE had generated it for me. I just had to be careful for my case.
Thanks for all the help! Deepika -----Original Message----- From: Deepika Khera [mailto:[email protected]] Sent: Tuesday, May 25, 2010 2:03 PM To: [email protected] Subject: RE: Re: Re: Hash Partitioner So I ran my process again with some more logging and here is what I see. I used my own HashPartitioner(basically copied hadoop's partitioner and added some logging for analysis).I printed here the Key and the reducer that is assigned to the key (based on the hash code). My process triggered off 2 mappers (running on 2 different hadoop machines), hence both of these try to find reducers for the split file assigned to them. I see that for a same object key assigned to both these mappers, I am getting 2 different reducers allocated by the Partitioner. In the reducers I see - 1) 2 different reducers (the ones that the partitioner assigned the key to) printing out the same Key (I did not print out the value as I thought that wouldn't matter) 2) Here are the logs from where reducers copies data from the mapper - Reducer1: 2010-05-25 11:34:49,810 INFO org.apache.hadoop.mapred.ReduceTask: Read 1002612 bytes from map-output for attempt_201005251129_0001_m_000001_0 2010-05-25 11:34:49,831 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_201005251129_0001_m_000001_0 -> (127, 36) from hadoop-49.c.a.com 2010-05-25 11:34:50,797 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201005251129_0001_r_000006_0: Got 1 new map-outputs 2010-05-25 11:34:54,835 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201005251129_0001_r_000006_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts) 2010-05-25 11:34:54,841 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_201005251129_0001_m_000000_0, compressed len: 1553902, decompressed len: 1553898 2010-05-25 11:34:54,841 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 1553898 bytes (1553902 raw bytes) into RAM from attempt_201005251129_0001_m_000000_0 2010-05-25 11:34:54,924 INFO org.apache.hadoop.mapred.ReduceTask: Read 1553898 bytes from map-output for attempt_201005251129_0001_m_000000_0 2010-05-25 11:34:54,944 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_201005251129_0001_m_000000_0 -> (143, 36) from hadoop-25.c.a.com Reducer2: 2010-05-25 11:34:49,822 INFO org.apache.hadoop.mapred.ReduceTask: Read 637657 bytes from map-output for attempt_201005251129_0001_m_000001_0 2010-05-25 11:34:49,911 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_201005251129_0001_m_000001_0 -> (125, 36) from hadoop-49.c.a.com 2010-05-25 11:34:50,806 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201005251129_0001_r_000008_0: Got 1 new map-outputs 2010-05-25 11:34:54,915 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201005251129_0001_r_000008_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts) 2010-05-25 11:34:54,920 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_201005251129_0001_m_000000_0, compressed len: 1462335, decompressed len: 1462331 2010-05-25 11:34:54,920 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 1462331 bytes (1462335 raw bytes) into RAM from attempt_201005251129_0001_m_000000_0 2010-05-25 11:34:54,937 INFO org.apache.hadoop.mapred.ReduceTask: Read 1462331 bytes from map-output for attempt_201005251129_0001_m_000000_0 2010-05-25 11:34:54,937 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_201005251129_0001_m_000000_0 -> (147, 36) from hadoop-25.c.a.com The 2 reduce tasks have different task ids and belong to the same job. Thanks, Deepika -----Original Message----- From: Eric Sammer [mailto:[email protected]] Sent: Tuesday, May 25, 2010 8:10 AM To: [email protected] Subject: Re: Re: Hash Partitioner On Mon, May 24, 2010 at 6:32 PM, Deepika Khera <[email protected]> wrote: > Thanks for your response Eric. > > I am using hadoop 0.20.2. > > Here is what the hashCode() implementation looks like (I actually had the IDE > generate it for me) > > Main key (for mapper & reducer): > > public int hashCode() { > int result = kVersion; > result = 31 * result + (aKey != null ? aKey.hashCode() : 0); > result = 31 * result + (gKey != null ? gKey.hashCode() : 0); > result = 31 * result + (int) (date ^ (date >>> 32)); > result = 31 * result + (ma != null ? ma.hashCode() : 0); > result = 31 * result + (cl != null ? cl.hashCode() : 0); > return result; > } > > > aKey : AKey class > > > public int hashCode() { > int result = kVersion; > result = 31 * result + (v != null ? v.hashCode() : 0); > result = 31 * result + (s != null ? s.hashCode() : 0); > result = 31 * result + (o != null ? o.hashCode() : 0); > result = 31 * result + (l != null ? l.hashCode() : 0); > result = 31 * result + (e ? 1 : 0); //boolean > result = 31 * result + (li ? 1 : 0); //boolean > result = 31 * result + (aut ? 1 : 0); //boolean > return result; > } > Both of these look fine, assuming all the other hashCode()s return the same value every time. > When this happens, I do see the same values for the key. Also I am not using > a grouping comparator. So you see two reduce methods getting the same key with the same values? That's extremely odd. If this is the case, there's a bug in Hadoop. Can you find the relevant logs from the reducers where Hadoop fetches the map output? Does it look like its fetching the same output twice? Do the two tasks where you see the duplicates have the same task ID? Can you confirm the reduce tasks are from the same job ID for us? > I was wondering since the call to HashPartitioner.getPartition() is done from > a map task, several of which are running on different machines, is it > possible that they get a different hashcode and hence get different reducers > assigned even when the key is the same. The hashCode() result should *always* be the same given the same internal state. In other words, it should be consistent and stable. If I have a string new String("hello world") it will always have the exact same hashCode(). If this isn't true, you will get wildly unpredictable results not just with Hadoop but with Java's comparators, collections, etc. -- Eric Sammer phone: +1-917-287-2675 twitter: esammer data: www.cloudera.com
