OK, I think I sumbled upon the correct incantation: time hadoop jar /opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \ -D map.output.key.field.separator=: \ -D mapred.text.key.partitioner.options=-k1,1 \ -D mapred.reduce.tasks=16 \ -input /tmp/krb/part \ -output /tmp/krb/mp \ -mapper /bin/cat \ -reducer /bin/cat \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
This will partition and sort the files as I expect, leaving me with 16 output files, 14 of which are empty and 2 non-empty. If I increase the number of partitions in the data so they exceed the number of reduce tasks, multiple partitions will be written to some or all of the output files. I believe I can deal with that now that I understand it, but it would be nice if the number of output files was equal to the number of partitions in the data. -K On Thu, Feb 10, 2011 at 11:45 AM, Kelly Burkhart <[email protected]> wrote: > Hi, > > I'm trying to get partitioning working from a streaming map/reduce > job. I'm using hadoop r0.20.2. > > Consider the following files, both in the same hdfs directory: > > f1: > 01:01:01<TAB>a,a,a,a,a,1 > 01:01:02<TAB>a,a,a,a,a,2 > 01:02:01<TAB>a,a,a,a,a,3 > 01:02:02<TAB>a,a,a,a,a,4 > 02:01:01<TAB>a,a,a,a,a,5 > 02:01:02<TAB>a,a,a,a,a,6 > 02:02:01<TAB>a,a,a,a,a,7 > 02:02:02<TAB>a,a,a,a,a,8 > > f2: > 01:01:01<TAB>b,b,b,b,b,1 > 01:01:02<TAB>b,b,b,b,b,2 > 01:02:01<TAB>b,b,b,b,b,3 > 01:02:02<TAB>b,b,b,b,b,4 > 02:01:01<TAB>b,b,b,b,b,5 > 02:01:02<TAB>b,b,b,b,b,6 > 02:02:01<TAB>b,b,b,b,b,7 > 02:02:02<TAB>b,b,b,b,b,8 > > I execute the following command: > > hadoop jar /opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \ > -D stream.map.output.field.separator=: \ > -D stream.num.map.output.key.fields=3 \ > -D map.output.key.field.separator=: \ > -D mapred.text.key.partitioner.options=-k1,1 \ > -input /tmp/krb/part \ > -output /tmp/krb/mp \ > -mapper /bin/cat \ > -reducer /bin/cat \ > -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner > > (actually I've executed about a zillion permutations of various -D > arguments...) > > I end up with a single file sorted by the entire key, exactly what I > expect if no partitioning at all is going on. What I'm hoping to end > up with is two output files, each file has the first component of the > key in common: > > 01:01:01<TAB>a,a,a,a,a,1 > 01:01:01<TAB>b,b,b,b,b,1 > 01:01:02<TAB>a,a,a,a,a,2 > 01:01:02<TAB>b,b,b,b,b,2 > 01:02:01<TAB>a,a,a,a,a,3 > 01:02:01<TAB>b,b,b,b,b,3 > 01:02:02<TAB>a,a,a,a,a,4 > 01:02:02<TAB>b,b,b,b,b,4 > > Can anyone suggest a command that may partition files as I describe? > > Also, it seems that the API has changed considerably from my version > 0.20.x to the latest version r0.21. Is 0.20 expected to work? Or are > there some fatal issues that forced major work resulting in release > 0.21. > > Thanks, > > -Kelly >
