On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
[email protected]> wrote:

> Hi Zaki,
>
> Just to narrow down the problem, can you do:
>
> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
> dump A;
>

This produced 143710 records;


> and
>
> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
> timestamp:chararray,
> ip:chararray,
> userid:chararray,
> dist:chararray,
> clickid:chararray,
> usra:chararray,
> campaign:chararray,
> clickurl:chararray,
> plugin:chararray,
> tab:chararray,
> feature:chararray);
> dump A;
>


This produced 143710 records (so no problem there);


> and
>
> cut -f8 *week.46*clickLog.2009* | wc -l
>


This produced...
175572

Clearly, something is wrong...


Thanks,
> Ashutosh
>
> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <[email protected]>
> wrote:
> > Hi All,
> >
> > I have the following mini-script running as part of a larger set of
> > scripts/workflow... however it seems like pig is dropping records as when
> I
> > tried running the same thing as a simple grep | wc -l I get a completely
> > different result (2500 with Pig vs. 3300). The Pig script is as follows:
> >
> > A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
> > (timestamp:chararray,
> > ip:chararray,
> > userid:chararray,
> > dist:chararray,
> > clickid:chararray,
> > usra:chararray,
> > campaign:chararray,
> > clickurl:chararray,
> > plugin:chararray,
> > tab:chararray,
> > feature:chararray);
> >
> > B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
> >
> > dump B produces the following output:
> > 2009-11-19 18:50:46,013 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
> > 2009-11-19 18:50:46,058 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Records written : 2502
> > 2009-11-19 18:50:46,058 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Bytes written : 0
> > 2009-11-19 18:50:46,058 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Success!
> >
> >
> > The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
> > http://www.amazon | wc -l
> >
> > Both sets of inputs are the same files... and I'm not sure where the
> > discrepency is coming from. Any help would be greatly appreciated.
> >
> > --
> > Zaki Rahaman
> >
>



-- 
Zaki Rahaman

Reply via email to