Hmm... You are sure that your records are separated by /n (newline) and fields by /t (tab). If so, will it be possible you to upload your dataset (possibly smaller) somewhere so that someone can take a look at that.
Ashutosh On Thu, Nov 19, 2009 at 14:35, zaki rahaman <[email protected]> wrote: > On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan < > [email protected]> wrote: > >> Hi Zaki, >> >> Just to narrow down the problem, can you do: >> >> A = LOAD 's3n://bucket/*week.46*clickLog.2009*'; >> dump A; >> > > This produced 143710 records; > > >> and >> >> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS ( >> timestamp:chararray, >> ip:chararray, >> userid:chararray, >> dist:chararray, >> clickid:chararray, >> usra:chararray, >> campaign:chararray, >> clickurl:chararray, >> plugin:chararray, >> tab:chararray, >> feature:chararray); >> dump A; >> > > > This produced 143710 records (so no problem there); > > >> and >> >> cut -f8 *week.46*clickLog.2009* | wc -l >> > > > This produced... > 175572 > > Clearly, something is wrong... > > > Thanks, >> Ashutosh >> >> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <[email protected]> >> wrote: >> > Hi All, >> > >> > I have the following mini-script running as part of a larger set of >> > scripts/workflow... however it seems like pig is dropping records as when >> I >> > tried running the same thing as a simple grep | wc -l I get a completely >> > different result (2500 with Pig vs. 3300). The Pig script is as follows: >> > >> > A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS >> > (timestamp:chararray, >> > ip:chararray, >> > userid:chararray, >> > dist:chararray, >> > clickid:chararray, >> > usra:chararray, >> > campaign:chararray, >> > clickurl:chararray, >> > plugin:chararray, >> > tab:chararray, >> > feature:chararray); >> > >> > B = FILTER raw BY clickurl matches '.*http://www.amazon.*'; >> > >> > dump B produces the following output: >> > 2009-11-19 18:50:46,013 [main] INFO >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> > - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2" >> > 2009-11-19 18:50:46,058 [main] INFO >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> > - Records written : 2502 >> > 2009-11-19 18:50:46,058 [main] INFO >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> > - Bytes written : 0 >> > 2009-11-19 18:50:46,058 [main] INFO >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> > - Success! >> > >> > >> > The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep >> > http://www.amazon | wc -l >> > >> > Both sets of inputs are the same files... and I'm not sure where the >> > discrepency is coming from. Any help would be greatly appreciated. >> > >> > -- >> > Zaki Rahaman >> > >> > > > > -- > Zaki Rahaman >
