thanks! Matthew : * * * how about using SecondarySory to get <key,values>, the values are sorted for every key.* *then traverse the sorted values to get all unique values.* * * * I am not sure which way is more efficient. I doubt HashSet is a complicated data structure. * 2011/8/3 Matthew John <[email protected]>
> Hey, > > I feel HashSet is a good method to dedup. To increase the overall > efficiency > you could also look into Combiner running the same Reducer code. That would > ensure less data in the sort-shuffle phase. > > Regards, > Matthew > > On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang <[email protected]> wrote: > > > hi,harsh > > After map, I can get all values for one key, but I want dedup these > > values, only get all unique values. now I just do it like the image. > > > > I think the following code is not efficient.(using a HashSet to > dedup) > > Thanks:) > > > > private static class MyReducer extends > > Reducer<LongWritable,LongWritable,LongWritable,LongsWritable> > > { > > HashSet<Long> uids=new HashSet<Long>(); > > LongsWritable unique_uids=new LongsWritable(); > > public void reduce(LongWritable key,Iterable<LongWritable> values,Context > > context)throws IOException,InterruptedException > > { > > uids.clear(); > > for(LongWritable v:values) > > { > > uids.add(v.get()); > > } > > int size=uids.size(); > > long[] l=new long[size]; > > int i=0; > > for(long uid:uids) > > { > > l[i]=uid; > > i++; > > } > > unique_uids.Set(l); > > context.write(key,unique_uids); > > } > > } > > > > > > 2011/8/3 Harsh J <[email protected]> > > > >> Use MapReduce :) > >> > >> If map output: (key, value) > >> Then reduce input becomes: (key, [iterator of values across all maps > >> with (key, value)]) > >> > >> I believe this is very similar to the wordcount example, but minus the > >> summing. For a given key, you get all the values that carry that key > >> in the reducer. Have you tried to run a simple program to achieve this > >> before asking? Or is something specifically not working? > >> > >> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <[email protected]> > wrote: > >> > HI, > >> > I hava many <key,value> pairs now, and want to get all different > >> values > >> > for each key, which way is efficient for this work. > >> > > >> > such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2> > >> > output: <1,2/3/4> <2,1/2> > >> > > >> > Thanks! > >> > > >> > walter > >> > > >> > >> > >> > >> -- > >> Harsh J > >> > > > > >
