Map reduce streaming unable to partition

Kelly Burkhart Thu, 10 Feb 2011 09:46:08 -0800

Hi,

I'm trying to get partitioning working from a streaming map/reduce
job.  I'm using hadoop r0.20.2.


Consider the following files, both in the same hdfs directory:

f1:
01:01:01<TAB>a,a,a,a,a,1
01:01:02<TAB>a,a,a,a,a,2
01:02:01<TAB>a,a,a,a,a,3
01:02:02<TAB>a,a,a,a,a,4
02:01:01<TAB>a,a,a,a,a,5
02:01:02<TAB>a,a,a,a,a,6
02:02:01<TAB>a,a,a,a,a,7
02:02:02<TAB>a,a,a,a,a,8

f2:
01:01:01<TAB>b,b,b,b,b,1
01:01:02<TAB>b,b,b,b,b,2
01:02:01<TAB>b,b,b,b,b,3
01:02:02<TAB>b,b,b,b,b,4
02:01:01<TAB>b,b,b,b,b,5
02:01:02<TAB>b,b,b,b,b,6
02:02:01<TAB>b,b,b,b,b,7
02:02:02<TAB>b,b,b,b,b,8

I execute the following command:

hadoop jar /opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \
  -D stream.map.output.field.separator=: \
  -D stream.num.map.output.key.fields=3 \
  -D map.output.key.field.separator=: \
  -D mapred.text.key.partitioner.options=-k1,1 \
  -input /tmp/krb/part \
  -output /tmp/krb/mp \
  -mapper /bin/cat \
  -reducer /bin/cat \
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

(actually I've executed about a zillion permutations of various -D arguments...)

I end up with a single file sorted by the entire key, exactly what I
expect if no partitioning at all is going on.  What I'm hoping to end
up with is two output files, each file has the first component of the
key in common:

01:01:01<TAB>a,a,a,a,a,1
01:01:01<TAB>b,b,b,b,b,1
01:01:02<TAB>a,a,a,a,a,2
01:01:02<TAB>b,b,b,b,b,2
01:02:01<TAB>a,a,a,a,a,3
01:02:01<TAB>b,b,b,b,b,3
01:02:02<TAB>a,a,a,a,a,4
01:02:02<TAB>b,b,b,b,b,4

Can anyone suggest a command that may partition files as I describe?

Also, it seems that the API has changed considerably from my version
0.20.x to the latest version r0.21.  Is 0.20 expected to work?  Or are
there some fatal issues that forced major work resulting in release
0.21.

Thanks,

-Kelly

Map reduce streaming unable to partition

Reply via email to