Hi Piyush, I think you need to override the inbuilt partitioning function. You can use function like (first field of key)%3 This will send all the keys with same first field to a separate reduce process Please correct me if I am wrong. Thanks Utkarsh From: Piyush Kansal [mailto:piyush.kan...@gmail.com] Sent: Monday, February 20, 2012 7:39 AM To: mapreduce-user@hadoop.apache.org Subject: Query regarding Hadoop Partitioning
Hi Friends, I have to sort huge amount of data in minimum possible time probably using partitioning. The key is composed of 3 fields(partition, text and number). This is how partition is defined: * Partition "1" for range 1-10 * Partition "2" for range 11-20 * Partition "3" for range 21-30 I/P file format: partition[tab]text[tab]range-start[tab]range-end [cloudera@localhost kMer2]$ cat input1 * 1 chr1 1 10 * 1 chr1 2 8 * 2 chr1 11 18 [cloudera@localhost kMer2]$ cat input2 * 1 chr1 3 7 * 2 chr1 12 19 [cloudera@localhost kMer2]$ cat input3 * 3 chr1 22 30 [cloudera@localhost kMer2]$ cat input4 * 3 chr1 22 30 * 1 chr1 9 10 * 2 chr1 15 16 Then I ran following command: hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \ -D stream.map.output.field.separator='\t' \ -D stream.num.map.output.key.fields=3 \ -D map.output.key.field.separator='\t' \ -D mapred.text.key.partitioner.options=-k1 \ -D mapred.reduce.tasks=3 \ -input /usr/pkansal/kMer2/ip \ -output /usr/pkansal/kMer2/op \ -mapper /home/cloudera/kMer2/kMer2Map.py \ -file /home/cloudera/kMer2/kMer2Map.py \ -reducer /home/cloudera/kMer2/kMer2Red.py \ -file /home/cloudera/kMer2/kMer2Red.py Both mapper and reducer scripts just contain one line of code: for line in sys.stdin: line = line.strip() print "%s" % (line) Following is the o/p: [cloudera@localhost kMer2]$ hadoop dfs -cat /usr/pkansal/kMer2/op/part-00000 * 2 chr1 12 19 * 2 chr1 15 16 * 3 chr1 22 30 * 3 chr1 22 30 [cloudera@localhost kMer2]$ hadoop dfs -cat /usr/pkansal/kMer2/op/part-00001 * 1 chr1 2 8 * 1 chr1 3 7 * 1 chr1 9 10 * 2 chr1 11 18 [cloudera@localhost kMer2]$ hadoop dfs -cat /usr/pkansal/kMer2/op/part-00002 * 1 chr1 1 10 * 3 chr1 22 29 This is not the o/p which I expected. I expected all records with: * partition 1 in one single file eg part-m-00000 * partition 2 in one single file eg part-m-00001 * partition 3 in one single file eg part-m-00002 Can you please suggest if I am doing it in a right way? -- Regards, Piyush Kansal **************** CAUTION - Disclaimer ***************** This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS******** End of Disclaimer ********INFOSYS***