RE: Query regarding Hadoop Partitioning

Utkarsh Gupta Sun, 19 Feb 2012 19:40:37 -0800

Hi Piyush,

I think you need to override the inbuilt partitioning function.
You can use function like (first field of key)%3
This will send all the keys with same first field to a separate reduce process
Please correct me if I am wrong.
Thanks
Utkarsh
From: Piyush Kansal [mailto:piyush.kan...@gmail.com]
Sent: Monday, February 20, 2012 7:39 AM
To: mapreduce-user@hadoop.apache.org
Subject: Query regarding Hadoop Partitioning



Hi Friends,

I have to sort huge amount of data in minimum possible time probably using 
partitioning. The key is composed of 3 fields(partition, text and number). This 
is how partition is defined:

 *   Partition "1" for range 1-10
 *   Partition "2" for range 11-20
 *   Partition "3" for range 21-30

I/P file format: partition[tab]text[tab]range-start[tab]range-end

[cloudera@localhost kMer2]$ cat input1

 *   1 chr1 1 10
 *   1 chr1 2 8
 *   2 chr1 11 18

[cloudera@localhost kMer2]$ cat input2

 *   1 chr1 3 7
 *   2 chr1 12 19

[cloudera@localhost kMer2]$ cat input3

 *   3 chr1 22 30

[cloudera@localhost kMer2]$ cat input4

 *   3 chr1 22 30
 *   1 chr1 9 10
 *   2 chr1 15 16

Then I ran following command:

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar 
\

-D stream.map.output.field.separator='\t' \

-D stream.num.map.output.key.fields=3 \

-D map.output.key.field.separator='\t' \

-D mapred.text.key.partitioner.options=-k1 \

-D mapred.reduce.tasks=3 \

-input /usr/pkansal/kMer2/ip \

-output /usr/pkansal/kMer2/op \

-mapper /home/cloudera/kMer2/kMer2Map.py \

-file /home/cloudera/kMer2/kMer2Map.py \

-reducer /home/cloudera/kMer2/kMer2Red.py \

-file /home/cloudera/kMer2/kMer2Red.py

Both mapper and reducer scripts just contain one line of code:

for line in sys.stdin:

    line = line.strip()

    print "%s" % (line)

Following is the o/p:

[cloudera@localhost kMer2]$ hadoop dfs -cat /usr/pkansal/kMer2/op/part-00000

 *   2 chr1 12 19
 *   2 chr1 15 16
 *   3 chr1 22 30
 *   3 chr1 22 30

[cloudera@localhost kMer2]$ hadoop dfs -cat /usr/pkansal/kMer2/op/part-00001

 *   1 chr1 2 8
 *   1 chr1 3 7
 *   1 chr1 9 10
 *   2 chr1 11 18

[cloudera@localhost kMer2]$ hadoop dfs -cat /usr/pkansal/kMer2/op/part-00002

 *   1 chr1 1 10
 *   3 chr1 22 29

This is not the o/p which I expected. I expected all records with:

 *   partition 1 in one single file eg part-m-00000
 *   partition 2 in one single file eg part-m-00001
 *   partition 3 in one single file eg part-m-00002

Can you please suggest if I am doing it in a right way?
--
Regards,
Piyush Kansal

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

RE: Query regarding Hadoop Partitioning

Reply via email to