Another thing to verify is that clickurl's position in the schema is correct. -Thejas
On 11/19/09 11:43 AM, "Ashutosh Chauhan" <[email protected]> wrote: > Hmm... You are sure that your records are separated by /n (newline) > and fields by /t (tab). If so, will it be possible you to upload your > dataset (possibly smaller) somewhere so that someone can take a look > at that. > > Ashutosh > > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <[email protected]> wrote: >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan < >> [email protected]> wrote: >> >>> Hi Zaki, >>> >>> Just to narrow down the problem, can you do: >>> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*'; >>> dump A; >>> >> >> This produced 143710 records; >> >> >>> and >>> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS ( >>> timestamp:chararray, >>> ip:chararray, >>> userid:chararray, >>> dist:chararray, >>> clickid:chararray, >>> usra:chararray, >>> campaign:chararray, >>> clickurl:chararray, >>> plugin:chararray, >>> tab:chararray, >>> feature:chararray); >>> dump A; >>> >> >> >> This produced 143710 records (so no problem there); >> >> >>> and >>> >>> cut -f8 *week.46*clickLog.2009* | wc -l >>> >> >> >> This produced... >> 175572 >> >> Clearly, something is wrong... >> >> >> Thanks, >>> Ashutosh >>> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <[email protected]> >>> wrote: >>>> Hi All, >>>> >>>> I have the following mini-script running as part of a larger set of >>>> scripts/workflow... however it seems like pig is dropping records as when >>> I >>>> tried running the same thing as a simple grep | wc -l I get a completely >>>> different result (2500 with Pig vs. 3300). The Pig script is as follows: >>>> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS >>>> (timestamp:chararray, >>>> ip:chararray, >>>> userid:chararray, >>>> dist:chararray, >>>> clickid:chararray, >>>> usra:chararray, >>>> campaign:chararray, >>>> clickurl:chararray, >>>> plugin:chararray, >>>> tab:chararray, >>>> feature:chararray); >>>> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*'; >>>> >>>> dump B produces the following output: >>>> 2009-11-19 18:50:46,013 [main] INFO >>>> >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch >>> er >>>> - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2" >>>> 2009-11-19 18:50:46,058 [main] INFO >>>> >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch >>> er >>>> - Records written : 2502 >>>> 2009-11-19 18:50:46,058 [main] INFO >>>> >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch >>> er >>>> - Bytes written : 0 >>>> 2009-11-19 18:50:46,058 [main] INFO >>>> >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch >>> er >>>> - Success! >>>> >>>> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep >>>> http://www.amazon | wc -l >>>> >>>> Both sets of inputs are the same files... and I'm not sure where the >>>> discrepency is coming from. Any help would be greatly appreciated. >>>> >>>> -- >>>> Zaki Rahaman >>>> >>> >> >> >> >> -- >> Zaki Rahaman >>
