Hi, I'm trying to get partitioning working from a streaming map/reduce job. I'm using hadoop r0.20.2.
Consider the following files, both in the same hdfs directory: f1: 01:01:01<TAB>a,a,a,a,a,1 01:01:02<TAB>a,a,a,a,a,2 01:02:01<TAB>a,a,a,a,a,3 01:02:02<TAB>a,a,a,a,a,4 02:01:01<TAB>a,a,a,a,a,5 02:01:02<TAB>a,a,a,a,a,6 02:02:01<TAB>a,a,a,a,a,7 02:02:02<TAB>a,a,a,a,a,8 f2: 01:01:01<TAB>b,b,b,b,b,1 01:01:02<TAB>b,b,b,b,b,2 01:02:01<TAB>b,b,b,b,b,3 01:02:02<TAB>b,b,b,b,b,4 02:01:01<TAB>b,b,b,b,b,5 02:01:02<TAB>b,b,b,b,b,6 02:02:01<TAB>b,b,b,b,b,7 02:02:02<TAB>b,b,b,b,b,8 I execute the following command: hadoop jar /opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \ -D stream.map.output.field.separator=: \ -D stream.num.map.output.key.fields=3 \ -D map.output.key.field.separator=: \ -D mapred.text.key.partitioner.options=-k1,1 \ -input /tmp/krb/part \ -output /tmp/krb/mp \ -mapper /bin/cat \ -reducer /bin/cat \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner (actually I've executed about a zillion permutations of various -D arguments...) I end up with a single file sorted by the entire key, exactly what I expect if no partitioning at all is going on. What I'm hoping to end up with is two output files, each file has the first component of the key in common: 01:01:01<TAB>a,a,a,a,a,1 01:01:01<TAB>b,b,b,b,b,1 01:01:02<TAB>a,a,a,a,a,2 01:01:02<TAB>b,b,b,b,b,2 01:02:01<TAB>a,a,a,a,a,3 01:02:01<TAB>b,b,b,b,b,3 01:02:02<TAB>a,a,a,a,a,4 01:02:02<TAB>b,b,b,b,b,4 Can anyone suggest a command that may partition files as I describe? Also, it seems that the API has changed considerably from my version 0.20.x to the latest version r0.21. Is 0.20 expected to work? Or are there some fatal issues that forced major work resulting in release 0.21. Thanks, -Kelly
