Hi All, I have the following mini-script running as part of a larger set of scripts/workflow... however it seems like pig is dropping records as when I tried running the same thing as a simple grep | wc -l I get a completely different result (2500 with Pig vs. 3300). The Pig script is as follows:
A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (timestamp:chararray, ip:chararray, userid:chararray, dist:chararray, clickid:chararray, usra:chararray, campaign:chararray, clickurl:chararray, plugin:chararray, tab:chararray, feature:chararray); B = FILTER raw BY clickurl matches '.*http://www.amazon.*'; dump B produces the following output: 2009-11-19 18:50:46,013 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2" 2009-11-19 18:50:46,058 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Records written : 2502 2009-11-19 18:50:46,058 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Bytes written : 0 2009-11-19 18:50:46,058 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep http://www.amazon | wc -l Both sets of inputs are the same files... and I'm not sure where the discrepency is coming from. Any help would be greatly appreciated. -- Zaki Rahaman
