Secondary sort is the way to go. Easier to dedup a sorted input set. Although you can also try to filter in map and combine phases to a safe extent possible (sets, etc.), to speed up the process and reduce data transfers.
On Wed, Aug 3, 2011 at 4:07 PM, Jianxin Wang <[email protected]> wrote: > thanks! Matthew : > * > * > * how about using SecondarySory to get <key,values>, the values are > sorted for every key.* > *then traverse the sorted values to get all unique values.* > * * > * I am not sure which way is more efficient. I doubt HashSet is a > complicated data structure. > * > 2011/8/3 Matthew John <[email protected]> > >> Hey, >> >> I feel HashSet is a good method to dedup. To increase the overall >> efficiency >> you could also look into Combiner running the same Reducer code. That would >> ensure less data in the sort-shuffle phase. >> >> Regards, >> Matthew >> >> On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang <[email protected]> wrote: >> >> > hi,harsh >> > After map, I can get all values for one key, but I want dedup these >> > values, only get all unique values. now I just do it like the image. >> > >> > I think the following code is not efficient.(using a HashSet to >> dedup) >> > Thanks:) >> > >> > private static class MyReducer extends >> > Reducer<LongWritable,LongWritable,LongWritable,LongsWritable> >> > { >> > HashSet<Long> uids=new HashSet<Long>(); >> > LongsWritable unique_uids=new LongsWritable(); >> > public void reduce(LongWritable key,Iterable<LongWritable> values,Context >> > context)throws IOException,InterruptedException >> > { >> > uids.clear(); >> > for(LongWritable v:values) >> > { >> > uids.add(v.get()); >> > } >> > int size=uids.size(); >> > long[] l=new long[size]; >> > int i=0; >> > for(long uid:uids) >> > { >> > l[i]=uid; >> > i++; >> > } >> > unique_uids.Set(l); >> > context.write(key,unique_uids); >> > } >> > } >> > >> > >> > 2011/8/3 Harsh J <[email protected]> >> > >> >> Use MapReduce :) >> >> >> >> If map output: (key, value) >> >> Then reduce input becomes: (key, [iterator of values across all maps >> >> with (key, value)]) >> >> >> >> I believe this is very similar to the wordcount example, but minus the >> >> summing. For a given key, you get all the values that carry that key >> >> in the reducer. Have you tried to run a simple program to achieve this >> >> before asking? Or is something specifically not working? >> >> >> >> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <[email protected]> >> wrote: >> >> > HI, >> >> > I hava many <key,value> pairs now, and want to get all different >> >> values >> >> > for each key, which way is efficient for this work. >> >> > >> >> > such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2> >> >> > output: <1,2/3/4> <2,1/2> >> >> > >> >> > Thanks! >> >> > >> >> > walter >> >> > >> >> >> >> >> >> >> >> -- >> >> Harsh J >> >> >> > >> > >> > -- Harsh J
