Hi Zaki, Just to narrow down the problem, can you do:
A = LOAD 's3n://bucket/*week.46*clickLog.2009*'; dump A; and A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS ( timestamp:chararray, ip:chararray, userid:chararray, dist:chararray, clickid:chararray, usra:chararray, campaign:chararray, clickurl:chararray, plugin:chararray, tab:chararray, feature:chararray); dump A; and cut -f8 *week.46*clickLog.2009* | wc -l Thanks, Ashutosh On Thu, Nov 19, 2009 at 14:03, zaki rahaman <[email protected]> wrote: > Hi All, > > I have the following mini-script running as part of a larger set of > scripts/workflow... however it seems like pig is dropping records as when I > tried running the same thing as a simple grep | wc -l I get a completely > different result (2500 with Pig vs. 3300). The Pig script is as follows: > > A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS > (timestamp:chararray, > ip:chararray, > userid:chararray, > dist:chararray, > clickid:chararray, > usra:chararray, > campaign:chararray, > clickurl:chararray, > plugin:chararray, > tab:chararray, > feature:chararray); > > B = FILTER raw BY clickurl matches '.*http://www.amazon.*'; > > dump B produces the following output: > 2009-11-19 18:50:46,013 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2" > 2009-11-19 18:50:46,058 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Records written : 2502 > 2009-11-19 18:50:46,058 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Bytes written : 0 > 2009-11-19 18:50:46,058 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Success! > > > The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep > http://www.amazon | wc -l > > Both sets of inputs are the same files... and I'm not sure where the > discrepency is coming from. Any help would be greatly appreciated. > > -- > Zaki Rahaman >
