Zaki, Glad to hear it wasn't Pig's fault! Can you post a description of what was going on with S3, or at least how you fixed it?
-D On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman <[email protected]> wrote: > Okay fixed some problem with corrupted file transfers from S3... now wc -l > produces the same 143710 records... so yea its not a problem... and now I am > getting the correct result from both methods... not sure what went wrong... > thanks for the help though guys. > > On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <[email protected]> wrote: > >> Another thing to verify is that clickurl's position in the schema is >> correct. >> -Thejas >> >> >> >> On 11/19/09 11:43 AM, "Ashutosh Chauhan" <[email protected]> >> wrote: >> >> > Hmm... You are sure that your records are separated by /n (newline) >> > and fields by /t (tab). If so, will it be possible you to upload your >> > dataset (possibly smaller) somewhere so that someone can take a look >> > at that. >> > >> > Ashutosh >> > >> > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <[email protected]> >> wrote: >> >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan < >> >> [email protected]> wrote: >> >> >> >>> Hi Zaki, >> >>> >> >>> Just to narrow down the problem, can you do: >> >>> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*'; >> >>> dump A; >> >>> >> >> >> >> This produced 143710 records; >> >> >> >> >> >>> and >> >>> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS ( >> >>> timestamp:chararray, >> >>> ip:chararray, >> >>> userid:chararray, >> >>> dist:chararray, >> >>> clickid:chararray, >> >>> usra:chararray, >> >>> campaign:chararray, >> >>> clickurl:chararray, >> >>> plugin:chararray, >> >>> tab:chararray, >> >>> feature:chararray); >> >>> dump A; >> >>> >> >> >> >> >> >> This produced 143710 records (so no problem there); >> >> >> >> >> >>> and >> >>> >> >>> cut -f8 *week.46*clickLog.2009* | wc -l >> >>> >> >> >> >> >> >> This produced... >> >> 175572 >> >> >> >> Clearly, something is wrong... >> >> >> >> >> >> Thanks, >> >>> Ashutosh >> >>> >> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <[email protected]> >> >>> wrote: >> >>>> Hi All, >> >>>> >> >>>> I have the following mini-script running as part of a larger set of >> >>>> scripts/workflow... however it seems like pig is dropping records as >> when >> >>> I >> >>>> tried running the same thing as a simple grep | wc -l I get a >> completely >> >>>> different result (2500 with Pig vs. 3300). The Pig script is as >> follows: >> >>>> >> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS >> >>>> (timestamp:chararray, >> >>>> ip:chararray, >> >>>> userid:chararray, >> >>>> dist:chararray, >> >>>> clickid:chararray, >> >>>> usra:chararray, >> >>>> campaign:chararray, >> >>>> clickurl:chararray, >> >>>> plugin:chararray, >> >>>> tab:chararray, >> >>>> feature:chararray); >> >>>> >> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*'; >> >>>> >> >>>> dump B produces the following output: >> >>>> 2009-11-19 18:50:46,013 [main] INFO >> >>>> >> >>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch >> >>> er >> >>>> - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2" >> >>>> 2009-11-19 18:50:46,058 [main] INFO >> >>>> >> >>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch >> >>> er >> >>>> - Records written : 2502 >> >>>> 2009-11-19 18:50:46,058 [main] INFO >> >>>> >> >>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch >> >>> er >> >>>> - Bytes written : 0 >> >>>> 2009-11-19 18:50:46,058 [main] INFO >> >>>> >> >>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch >> >>> er >> >>>> - Success! >> >>>> >> >>>> >> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep >> >>>> http://www.amazon | wc -l >> >>>> >> >>>> Both sets of inputs are the same files... and I'm not sure where the >> >>>> discrepency is coming from. Any help would be greatly appreciated. >> >>>> >> >>>> -- >> >>>> Zaki Rahaman >> >>>> >> >>> >> >> >> >> >> >> >> >> -- >> >> Zaki Rahaman >> >> >> >> > > > -- > Zaki Rahaman >
