I do not think you can do it out of the box with streaming, but last.fm's Dumbo (highly recommended if you use Python M/R) and its add-on Feathers libraries can do it apparently.
See Erik Forsberg's detailed answer (second) on http://stackoverflow.com/questions/1626786/generating-separate-output-files-in-hadoop-streaming for more. On Mon, Feb 20, 2012 at 1:57 PM, Piyush Kansal <piyush.kan...@gmail.com> wrote: > Thanks for the immediate reply Harsh. I will try using it. > > By the way, cant we achieve the same goal with Hadoop Streaming (using > Python)? > > > On Mon, Feb 20, 2012 at 2:59 AM, Harsh J <ha...@cloudera.com> wrote: >> >> Piyush, >> >> Yes. Currently the partitioned data is always sorted by (and then >> grouped by) keys before the reduce() calls begin. >> >> On Mon, Feb 20, 2012 at 12:51 PM, Piyush Kansal <piyush.kan...@gmail.com> >> wrote: >> > Thanks Harsh. >> > >> > But will it also sort the data as Partitioner does. >> > >> > >> > On Sun, Feb 19, 2012 at 10:54 PM, Harsh J <ha...@cloudera.com> wrote: >> >> >> >> Hi, >> >> >> >> You would find it easier to use the Java API's MultipleOutputs (and/or >> >> MultipleOutputFormat, which directly works on a configured key field), >> >> to write each key-partition out in its own file. >> >> >> >> On Mon, Feb 20, 2012 at 7:38 AM, Piyush Kansal >> >> <piyush.kan...@gmail.com> >> >> wrote: >> >> > Hi Friends, >> >> > >> >> > I have to sort huge amount of data in minimum possible time probably >> >> > using >> >> > partitioning. The key is composed of 3 fields(partition, text and >> >> > number). >> >> > This is how partition is defined: >> >> > >> >> > Partition "1" for range 1-10 >> >> > Partition "2" for range 11-20 >> >> > Partition "3" for range 21-30 >> >> > >> >> > I/P file format: partition[tab]text[tab]range-start[tab]range-end >> >> > >> >> > [cloudera@localhost kMer2]$ cat input1 >> >> > >> >> > 1 chr1 1 10 >> >> > 1 chr1 2 8 >> >> > 2 chr1 11 18 >> >> > >> >> > [cloudera@localhost kMer2]$ cat input2 >> >> > >> >> > 1 chr1 3 7 >> >> > 2 chr1 12 19 >> >> > >> >> > [cloudera@localhost kMer2]$ cat input3 >> >> > >> >> > 3 chr1 22 30 >> >> > >> >> > [cloudera@localhost kMer2]$ cat input4 >> >> > >> >> > 3 chr1 22 30 >> >> > 1 chr1 9 10 >> >> > 2 chr1 15 16 >> >> > >> >> > Then I ran following command: >> >> > >> >> > hadoop jar >> >> > /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar >> >> > \ >> >> > -D stream.map.output.field.separator='\t' \ >> >> > -D stream.num.map.output.key.fields=3 \ >> >> > -D map.output.key.field.separator='\t' \ >> >> > -D mapred.text.key.partitioner.options=-k1 \ >> >> > -D mapred.reduce.tasks=3 \ >> >> > -input /usr/pkansal/kMer2/ip \ >> >> > -output /usr/pkansal/kMer2/op \ >> >> > -mapper /home/cloudera/kMer2/kMer2Map.py \ >> >> > -file /home/cloudera/kMer2/kMer2Map.py \ >> >> > -reducer /home/cloudera/kMer2/kMer2Red.py \ >> >> > -file /home/cloudera/kMer2/kMer2Red.py >> >> > >> >> > Both mapper and reducer scripts just contain one line of code: >> >> > >> >> > for line in sys.stdin: >> >> > line = line.strip() >> >> > print "%s" % (line) >> >> > >> >> > Following is the o/p: >> >> > >> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat >> >> > /usr/pkansal/kMer2/op/part-00000 >> >> > >> >> > 2 chr1 12 19 >> >> > 2 chr1 15 16 >> >> > 3 chr1 22 30 >> >> > 3 chr1 22 30 >> >> > >> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat >> >> > /usr/pkansal/kMer2/op/part-00001 >> >> > >> >> > 1 chr1 2 8 >> >> > 1 chr1 3 7 >> >> > 1 chr1 9 10 >> >> > 2 chr1 11 18 >> >> > >> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat >> >> > /usr/pkansal/kMer2/op/part-00002 >> >> > >> >> > 1 chr1 1 10 >> >> > 3 chr1 22 29 >> >> > >> >> > This is not the o/p which I expected. I expected all records with: >> >> > >> >> > partition 1 in one single file eg part-m-00000 >> >> > partition 2 in one single file eg part-m-00001 >> >> > partition 3 in one single file eg part-m-00002 >> >> > >> >> > Can you please suggest if I am doing it in a right way? >> >> > >> >> > -- >> >> > Regards, >> >> > Piyush Kansal >> >> > >> >> >> >> >> >> >> >> -- >> >> Harsh J >> >> Customer Ops. Engineer >> >> Cloudera | http://tiny.cloudera.com/about >> > >> > >> > >> > >> > -- >> > Regards, >> > Piyush Kansal >> > >> >> >> >> -- >> Harsh J >> Customer Ops. Engineer >> Cloudera | http://tiny.cloudera.com/about > > > > > -- > Regards, > Piyush Kansal > -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about