On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan < [email protected]> wrote:
> Hi Zaki, > > Just to narrow down the problem, can you do: > > A = LOAD 's3n://bucket/*week.46*clickLog.2009*'; > dump A; > This produced 143710 records; > and > > A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS ( > timestamp:chararray, > ip:chararray, > userid:chararray, > dist:chararray, > clickid:chararray, > usra:chararray, > campaign:chararray, > clickurl:chararray, > plugin:chararray, > tab:chararray, > feature:chararray); > dump A; > This produced 143710 records (so no problem there); > and > > cut -f8 *week.46*clickLog.2009* | wc -l > This produced... 175572 Clearly, something is wrong... Thanks, > Ashutosh > > On Thu, Nov 19, 2009 at 14:03, zaki rahaman <[email protected]> > wrote: > > Hi All, > > > > I have the following mini-script running as part of a larger set of > > scripts/workflow... however it seems like pig is dropping records as when > I > > tried running the same thing as a simple grep | wc -l I get a completely > > different result (2500 with Pig vs. 3300). The Pig script is as follows: > > > > A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS > > (timestamp:chararray, > > ip:chararray, > > userid:chararray, > > dist:chararray, > > clickid:chararray, > > usra:chararray, > > campaign:chararray, > > clickurl:chararray, > > plugin:chararray, > > tab:chararray, > > feature:chararray); > > > > B = FILTER raw BY clickurl matches '.*http://www.amazon.*'; > > > > dump B produces the following output: > > 2009-11-19 18:50:46,013 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > > - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2" > > 2009-11-19 18:50:46,058 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > > - Records written : 2502 > > 2009-11-19 18:50:46,058 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > > - Bytes written : 0 > > 2009-11-19 18:50:46,058 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > > - Success! > > > > > > The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep > > http://www.amazon | wc -l > > > > Both sets of inputs are the same files... and I'm not sure where the > > discrepency is coming from. Any help would be greatly appreciated. > > > > -- > > Zaki Rahaman > > > -- Zaki Rahaman
