Okay fixed some problem with corrupted file transfers from S3... now wc -l
produces the same 143710 records... so yea its not a problem... and now I am
getting the correct result from both methods... not sure what went wrong...
thanks for the help though guys.

On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <[email protected]> wrote:

> Another thing to verify is that clickurl's position in the schema is
> correct.
> -Thejas
>
>
>
> On 11/19/09 11:43 AM, "Ashutosh Chauhan" <[email protected]>
> wrote:
>
> > Hmm... You are sure that your records are separated by /n (newline)
> > and fields by /t (tab).  If so, will it be possible you to upload your
> > dataset (possibly smaller) somewhere so that someone can take a look
> > at that.
> >
> > Ashutosh
> >
> > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <[email protected]>
> wrote:
> >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
> >> [email protected]> wrote:
> >>
> >>> Hi Zaki,
> >>>
> >>> Just to narrow down the problem, can you do:
> >>>
> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
> >>> dump A;
> >>>
> >>
> >> This produced 143710 records;
> >>
> >>
> >>> and
> >>>
> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
> >>> timestamp:chararray,
> >>> ip:chararray,
> >>> userid:chararray,
> >>> dist:chararray,
> >>> clickid:chararray,
> >>> usra:chararray,
> >>> campaign:chararray,
> >>> clickurl:chararray,
> >>> plugin:chararray,
> >>> tab:chararray,
> >>> feature:chararray);
> >>> dump A;
> >>>
> >>
> >>
> >> This produced 143710 records (so no problem there);
> >>
> >>
> >>> and
> >>>
> >>> cut -f8 *week.46*clickLog.2009* | wc -l
> >>>
> >>
> >>
> >> This produced...
> >> 175572
> >>
> >> Clearly, something is wrong...
> >>
> >>
> >> Thanks,
> >>> Ashutosh
> >>>
> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <[email protected]>
> >>> wrote:
> >>>> Hi All,
> >>>>
> >>>> I have the following mini-script running as part of a larger set of
> >>>> scripts/workflow... however it seems like pig is dropping records as
> when
> >>> I
> >>>> tried running the same thing as a simple grep | wc -l I get a
> completely
> >>>> different result (2500 with Pig vs. 3300). The Pig script is as
> follows:
> >>>>
> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
> >>>> (timestamp:chararray,
> >>>> ip:chararray,
> >>>> userid:chararray,
> >>>> dist:chararray,
> >>>> clickid:chararray,
> >>>> usra:chararray,
> >>>> campaign:chararray,
> >>>> clickurl:chararray,
> >>>> plugin:chararray,
> >>>> tab:chararray,
> >>>> feature:chararray);
> >>>>
> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
> >>>>
> >>>> dump B produces the following output:
> >>>> 2009-11-19 18:50:46,013 [main] INFO
> >>>>
> >>>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >>> er
> >>>> - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >>>>
> >>>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >>> er
> >>>> - Records written : 2502
> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >>>>
> >>>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >>> er
> >>>> - Bytes written : 0
> >>>> 2009-11-19 18:50:46,058 [main] INFO
> >>>>
> >>>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
> >>> er
> >>>> - Success!
> >>>>
> >>>>
> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
> >>>> http://www.amazon | wc -l
> >>>>
> >>>> Both sets of inputs are the same files... and I'm not sure where the
> >>>> discrepency is coming from. Any help would be greatly appreciated.
> >>>>
> >>>> --
> >>>> Zaki Rahaman
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Zaki Rahaman
> >>
>
>


-- 
Zaki Rahaman

Reply via email to