Hi All,

I have the following mini-script running as part of a larger set of
scripts/workflow... however it seems like pig is dropping records as when I
tried running the same thing as a simple grep | wc -l I get a completely
different result (2500 with Pig vs. 3300). The Pig script is as follows:

A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
(timestamp:chararray,
ip:chararray,
userid:chararray,
dist:chararray,
clickid:chararray,
usra:chararray,
campaign:chararray,
clickurl:chararray,
plugin:chararray,
tab:chararray,
feature:chararray);

B = FILTER raw BY clickurl matches '.*http://www.amazon.*';

dump B produces the following output:
2009-11-19 18:50:46,013 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
2009-11-19 18:50:46,058 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Records written : 2502
2009-11-19 18:50:46,058 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Bytes written : 0
2009-11-19 18:50:46,058 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Success!


The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
http://www.amazon | wc -l

Both sets of inputs are the same files... and I'm not sure where the
discrepency is coming from. Any help would be greatly appreciated.

-- 
Zaki Rahaman

Reply via email to