Re: how to implements the 'diff' cmd in hadoop

botma lin Tue, 20 Mar 2012 04:22:56 -0700

Thanks a lot!


On Tue, Mar 20, 2012 at 7:13，Bejoy Ks <[email protected]> wrote：

> Yes, if you are having more than 2 files to be compared against then, the
> file name/ id is required from mapper. If it is just two files  and you
> just want to know which lines are not unique then just the line no would be
> good but if you are looking at more granular info like the exact changes in
> which all files then the value from mapper could be prefixed with some
> value like file name.
>
> Regards
> Bejoy KS
>
> 2012/3/20 botma lin <[email protected]>
>
> > Thanks  Bejoy, that makes sense .
> >
> >       If I want to know the different record's original file, I need to
> > put an extra file id into the mapper's output value, then get it in the
> > reducer .
> >
> >      Do you have any other ideas
> >
> > Thanks!.
> >
> >
> > On Tue, Mar 20, 2012 at 6:09 PM，Bejoy Ks <[email protected]> wrote：
> >
> > > Hi Lin
> > >        In you mapper make the line no as the key and the line contents
> as
> > > the value. In your reducer check whether the two values for a key are
> > > matching. ie if you are comparing two files then there would be two
> > values
> > > for a line number. If non matching patterns found increment a counter
> to
> > > determine the number of non matching patterns and write those patterns
> to
> > > output file . If the values matches for a key do nothing, no need even
> > > writing to output dir.
> > >
> > > Regards
> > > Bejoy KS
> > >
> > > On Tue, Mar 20, 2012 at 2:01 PM, botma lin <[email protected]> wrote:
> > >
> > > > Hi, all
> > > >
> > > >      I'm newbie to hadoop.
> > > >
> > > >      I'm trying to compare two large file and get the difference
> > between
> > > > them ,like the diff cmd in linux,
> > > >  however,  the mapred api can only get one record at a time . so how
> > can
> > > I
> > > > get the relative records in two files and compare them by using
> mapred
> > > api.
> > > >
> > > >     thinks!
> > > >
> > >
> >
>

Re: how to implements the 'diff' cmd in hadoop

Reply via email to