Hi Zaki,

Just to narrow down the problem, can you do:

A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
dump A;

and

A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
timestamp:chararray,
ip:chararray,
userid:chararray,
dist:chararray,
clickid:chararray,
usra:chararray,
campaign:chararray,
clickurl:chararray,
plugin:chararray,
tab:chararray,
feature:chararray);
dump A;

and

cut -f8 *week.46*clickLog.2009* | wc -l

Thanks,
Ashutosh

On Thu, Nov 19, 2009 at 14:03, zaki rahaman <[email protected]> wrote:
> Hi All,
>
> I have the following mini-script running as part of a larger set of
> scripts/workflow... however it seems like pig is dropping records as when I
> tried running the same thing as a simple grep | wc -l I get a completely
> different result (2500 with Pig vs. 3300). The Pig script is as follows:
>
> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
> (timestamp:chararray,
> ip:chararray,
> userid:chararray,
> dist:chararray,
> clickid:chararray,
> usra:chararray,
> campaign:chararray,
> clickurl:chararray,
> plugin:chararray,
> tab:chararray,
> feature:chararray);
>
> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
>
> dump B produces the following output:
> 2009-11-19 18:50:46,013 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
> 2009-11-19 18:50:46,058 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Records written : 2502
> 2009-11-19 18:50:46,058 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Bytes written : 0
> 2009-11-19 18:50:46,058 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Success!
>
>
> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
> http://www.amazon | wc -l
>
> Both sets of inputs are the same files... and I'm not sure where the
> discrepency is coming from. Any help would be greatly appreciated.
>
> --
> Zaki Rahaman
>

Reply via email to