Hey, I feel HashSet is a good method to dedup. To increase the overall efficiency you could also look into Combiner running the same Reducer code. That would ensure less data in the sort-shuffle phase.
Regards, Matthew On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang <[email protected]> wrote: > hi,harsh > After map, I can get all values for one key, but I want dedup these > values, only get all unique values. now I just do it like the image. > > I think the following code is not efficient.(using a HashSet to dedup) > Thanks:) > > private static class MyReducer extends > Reducer<LongWritable,LongWritable,LongWritable,LongsWritable> > { > HashSet<Long> uids=new HashSet<Long>(); > LongsWritable unique_uids=new LongsWritable(); > public void reduce(LongWritable key,Iterable<LongWritable> values,Context > context)throws IOException,InterruptedException > { > uids.clear(); > for(LongWritable v:values) > { > uids.add(v.get()); > } > int size=uids.size(); > long[] l=new long[size]; > int i=0; > for(long uid:uids) > { > l[i]=uid; > i++; > } > unique_uids.Set(l); > context.write(key,unique_uids); > } > } > > > 2011/8/3 Harsh J <[email protected]> > >> Use MapReduce :) >> >> If map output: (key, value) >> Then reduce input becomes: (key, [iterator of values across all maps >> with (key, value)]) >> >> I believe this is very similar to the wordcount example, but minus the >> summing. For a given key, you get all the values that carry that key >> in the reducer. Have you tried to run a simple program to achieve this >> before asking? Or is something specifically not working? >> >> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <[email protected]> wrote: >> > HI, >> > I hava many <key,value> pairs now, and want to get all different >> values >> > for each key, which way is efficient for this work. >> > >> > such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2> >> > output: <1,2/3/4> <2,1/2> >> > >> > Thanks! >> > >> > walter >> > >> >> >> >> -- >> Harsh J >> > >
