thanks! Matthew :
*
*
*    how about using SecondarySory to get <key,values>, the values are
sorted for every key.*
*then traverse the sorted values to get all unique values.*
*    *
*   I am not sure which way is more efficient. I doubt HashSet is a
complicated data structure.
*
2011/8/3 Matthew John <[email protected]>

> Hey,
>
> I feel HashSet is a good method to dedup. To increase the overall
> efficiency
> you could also look into Combiner running the same Reducer code. That would
> ensure less data in the sort-shuffle phase.
>
> Regards,
> Matthew
>
> On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang <[email protected]> wrote:
>
> > hi,harsh
> >     After map, I can get all values for one key, but I want dedup these
> > values, only get all unique values. now I just do it like the image.
> >
> >     I think the following code is not efficient.(using a HashSet to
> dedup)
> > Thanks:)
> >
> > private static class MyReducer extends
> > Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
> > {
> > HashSet<Long> uids=new HashSet<Long>();
> >  LongsWritable unique_uids=new LongsWritable();
> > public void reduce(LongWritable key,Iterable<LongWritable> values,Context
> > context)throws IOException,InterruptedException
> >  {
> > uids.clear();
> > for(LongWritable v:values)
> >  {
> > uids.add(v.get());
> > }
> >  int size=uids.size();
> > long[] l=new long[size];
> > int i=0;
> >  for(long uid:uids)
> > {
> > l[i]=uid;
> >  i++;
> > }
> > unique_uids.Set(l);
> >  context.write(key,unique_uids);
> > }
> > }
> >
> >
> > 2011/8/3 Harsh J <[email protected]>
> >
> >> Use MapReduce :)
> >>
> >> If map output: (key, value)
> >> Then reduce input becomes: (key, [iterator of values across all maps
> >> with (key, value)])
> >>
> >> I believe this is very similar to the wordcount example, but minus the
> >> summing. For a given key, you get all the values that carry that key
> >> in the reducer. Have you tried to run a simple program to achieve this
> >> before asking? Or is something specifically not working?
> >>
> >> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <[email protected]>
> wrote:
> >> > HI,
> >> >    I hava many <key,value> pairs now, and want to get all different
> >> values
> >> > for each key, which way is efficient for this work.
> >> >
> >> >   such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
> >> >   output: <1,2/3/4> <2,1/2>
> >> >
> >> >   Thanks!
> >> >
> >> > walter
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >
> >
>

Reply via email to