Re: how to implements the 'diff' cmd in hadoop

botma lin Tue, 20 Mar 2012 04:06:40 -0700

Thanks  Bejoy, that makes sense .

       If I want to know the different record's original file, I need to
put an extra file id into the mapper's output value, then get it in the
reducer .


      Do you have any other ideas

Thanks!.


On Tue, Mar 20, 2012 at 6:09 PM，Bejoy Ks <[email protected]> wrote：

> Hi Lin
>        In you mapper make the line no as the key and the line contents as
> the value. In your reducer check whether the two values for a key are
> matching. ie if you are comparing two files then there would be two values
> for a line number. If non matching patterns found increment a counter to
> determine the number of non matching patterns and write those patterns to
> output file . If the values matches for a key do nothing, no need even
> writing to output dir.
>
> Regards
> Bejoy KS
>
> On Tue, Mar 20, 2012 at 2:01 PM, botma lin <[email protected]> wrote:
>
> > Hi, all
> >
> >      I'm newbie to hadoop.
> >
> >      I'm trying to compare two large file and get the difference between
> > them ,like the diff cmd in linux,
> >  however,  the mapred api can only get one record at a time . so how can
> I
> > get the relative records in two files and compare them by using mapred
> api.
> >
> >     thinks!
> >
>

Re: how to implements the 'diff' cmd in hadoop

Reply via email to