Okay fixed some problem with corrupted file transfers from S3... now wc -l produces the same 143710 records... so yea its not a problem... and now I am getting the correct result from both methods... not sure what went wrong... thanks for the help though guys.
On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <[email protected]> wrote: > Another thing to verify is that clickurl's position in the schema is > correct. > -Thejas > > > > On 11/19/09 11:43 AM, "Ashutosh Chauhan" <[email protected]> > wrote: > > > Hmm... You are sure that your records are separated by /n (newline) > > and fields by /t (tab). If so, will it be possible you to upload your > > dataset (possibly smaller) somewhere so that someone can take a look > > at that. > > > > Ashutosh > > > > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <[email protected]> > wrote: > >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan < > >> [email protected]> wrote: > >> > >>> Hi Zaki, > >>> > >>> Just to narrow down the problem, can you do: > >>> > >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*'; > >>> dump A; > >>> > >> > >> This produced 143710 records; > >> > >> > >>> and > >>> > >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS ( > >>> timestamp:chararray, > >>> ip:chararray, > >>> userid:chararray, > >>> dist:chararray, > >>> clickid:chararray, > >>> usra:chararray, > >>> campaign:chararray, > >>> clickurl:chararray, > >>> plugin:chararray, > >>> tab:chararray, > >>> feature:chararray); > >>> dump A; > >>> > >> > >> > >> This produced 143710 records (so no problem there); > >> > >> > >>> and > >>> > >>> cut -f8 *week.46*clickLog.2009* | wc -l > >>> > >> > >> > >> This produced... > >> 175572 > >> > >> Clearly, something is wrong... > >> > >> > >> Thanks, > >>> Ashutosh > >>> > >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <[email protected]> > >>> wrote: > >>>> Hi All, > >>>> > >>>> I have the following mini-script running as part of a larger set of > >>>> scripts/workflow... however it seems like pig is dropping records as > when > >>> I > >>>> tried running the same thing as a simple grep | wc -l I get a > completely > >>>> different result (2500 with Pig vs. 3300). The Pig script is as > follows: > >>>> > >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS > >>>> (timestamp:chararray, > >>>> ip:chararray, > >>>> userid:chararray, > >>>> dist:chararray, > >>>> clickid:chararray, > >>>> usra:chararray, > >>>> campaign:chararray, > >>>> clickurl:chararray, > >>>> plugin:chararray, > >>>> tab:chararray, > >>>> feature:chararray); > >>>> > >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*'; > >>>> > >>>> dump B produces the following output: > >>>> 2009-11-19 18:50:46,013 [main] INFO > >>>> > >>> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch > >>> er > >>>> - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2" > >>>> 2009-11-19 18:50:46,058 [main] INFO > >>>> > >>> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch > >>> er > >>>> - Records written : 2502 > >>>> 2009-11-19 18:50:46,058 [main] INFO > >>>> > >>> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch > >>> er > >>>> - Bytes written : 0 > >>>> 2009-11-19 18:50:46,058 [main] INFO > >>>> > >>> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch > >>> er > >>>> - Success! > >>>> > >>>> > >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep > >>>> http://www.amazon | wc -l > >>>> > >>>> Both sets of inputs are the same files... and I'm not sure where the > >>>> discrepency is coming from. Any help would be greatly appreciated. > >>>> > >>>> -- > >>>> Zaki Rahaman > >>>> > >>> > >> > >> > >> > >> -- > >> Zaki Rahaman > >> > > -- Zaki Rahaman
