Sam, Can you post your changes to a Jira? -D
On Fri, Nov 20, 2009 at 1:28 PM, Sam Rash <s...@ning.com> wrote: > Hi, > > This reminds me of something else, though, that I took the latest patch for > PIG-911 (sequence file reader) and found it skipped records > > https://issues.apache.org/jira/browse/PIG-911 > > What I found is that the condition in getNext() would miss records: > > if (reader != null && (reader.getPosition() < end || !reader.syncSeen()) && > reader.next(key, value)) { > ... > } > > I had to change it to: > > if (reader != null && reader.next(key,value) && (reader.getPosition() < end > || !reader.syncSeen())) { > ... > } > > (also ended up breaking out to read(key) and get the below to support > reading other types than Writable) > > This only happened when I file files pig read where more than one block; ie, > the records dropped were around block boundaries. > > has anyone noticed this? > > thx, > -sr > > Sam Rash > s...@ning.com > > > > On Nov 19, 2009, at 4:48 PM, Dmitriy Ryaboy wrote: > >> Zaki, >> Glad to hear it wasn't Pig's fault! >> Can you post a description of what was going on with S3, or at least >> how you fixed it? >> >> -D >> >> On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman <zaki.raha...@gmail.com> >> wrote: >> > Okay fixed some problem with corrupted file transfers from S3... now wc >> > -l >> > produces the same 143710 records... so yea its not a problem... and now >> > I am >> > getting the correct result from both methods... not sure what went >> > wrong... >> > thanks for the help though guys. >> > >> > On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <te...@yahoo-inc.com> >> > wrote: >> > >> >> Another thing to verify is that clickurl's position in the schema is >> >> correct. >> >> -Thejas >> >> >> >> >> >> >> >> On 11/19/09 11:43 AM, "Ashutosh Chauhan" <ashutosh.chau...@gmail.com> >> >> wrote: >> >> >> >> > Hmm... You are sure that your records are separated by /n (newline) >> >> > and fields by /t (tab). If so, will it be possible you to upload >> >> > your >> >> > dataset (possibly smaller) somewhere so that someone can take a look >> >> > at that. >> >> > >> >> > Ashutosh >> >> > >> >> > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <zaki.raha...@gmail.com> >> >> wrote: >> >> >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan < >> >> >> ashutosh.chau...@gmail.com> wrote: >> >> >> >> >> >>> Hi Zaki, >> >> >>> >> >> >>> Just to narrow down the problem, can you do: >> >> >>> >> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*'; >> >> >>> dump A; >> >> >>> >> >> >> >> >> >> This produced 143710 records; >> >> >> >> >> >> >> >> >>> and >> >> >>> >> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS ( >> >> >>> timestamp:chararray, >> >> >>> ip:chararray, >> >> >>> userid:chararray, >> >> >>> dist:chararray, >> >> >>> clickid:chararray, >> >> >>> usra:chararray, >> >> >>> campaign:chararray, >> >> >>> clickurl:chararray, >> >> >>> plugin:chararray, >> >> >>> tab:chararray, >> >> >>> feature:chararray); >> >> >>> dump A; >> >> >>> >> >> >> >> >> >> >> >> >> This produced 143710 records (so no problem there); >> >> >> >> >> >> >> >> >>> and >> >> >>> >> >> >>> cut -f8 *week.46*clickLog.2009* | wc -l >> >> >>> >> >> >> >> >> >> >> >> >> This produced... >> >> >> 175572 >> >> >> >> >> >> Clearly, something is wrong... >> >> >> >> >> >> >> >> >> Thanks, >> >> >>> Ashutosh >> >> >>> >> >> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman >> >> >>> <zaki.raha...@gmail.com> >> >> >>> wrote: >> >> >>>> Hi All, >> >> >>>> >> >> >>>> I have the following mini-script running as part of a larger set >> >> >>>> of >> >> >>>> scripts/workflow... however it seems like pig is dropping records >> >> >>>> as >> >> when >> >> >>> I >> >> >>>> tried running the same thing as a simple grep | wc -l I get a >> >> completely >> >> >>>> different result (2500 with Pig vs. 3300). The Pig script is as >> >> follows: >> >> >>>> >> >> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS >> >> >>>> (timestamp:chararray, >> >> >>>> ip:chararray, >> >> >>>> userid:chararray, >> >> >>>> dist:chararray, >> >> >>>> clickid:chararray, >> >> >>>> usra:chararray, >> >> >>>> campaign:chararray, >> >> >>>> clickurl:chararray, >> >> >>>> plugin:chararray, >> >> >>>> tab:chararray, >> >> >>>> feature:chararray); >> >> >>>> >> >> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*'; >> >> >>>> >> >> >>>> dump B produces the following output: >> >> >>>> 2009-11-19 18:50:46,013 [main] INFO >> >> >>>> >> >> >>> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch >> >> >>> er >> >> >>>> - Successfully stored result in: >> >> >>>> "s3://kikin-pig-test/amazonoutput2" >> >> >>>> 2009-11-19 18:50:46,058 [main] INFO >> >> >>>> >> >> >>> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch >> >> >>> er >> >> >>>> - Records written : 2502 >> >> >>>> 2009-11-19 18:50:46,058 [main] INFO >> >> >>>> >> >> >>> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch >> >> >>> er >> >> >>>> - Bytes written : 0 >> >> >>>> 2009-11-19 18:50:46,058 [main] INFO >> >> >>>> >> >> >>> >> >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch >> >> >>> er >> >> >>>> - Success! >> >> >>>> >> >> >>>> >> >> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep >> >> >>>> http://www.amazon | wc -l >> >> >>>> >> >> >>>> Both sets of inputs are the same files... and I'm not sure where >> >> >>>> the >> >> >>>> discrepency is coming from. Any help would be greatly appreciated. >> >> >>>> >> >> >>>> -- >> >> >>>> Zaki Rahaman >> >> >>>> >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> Zaki Rahaman >> >> >> >> >> >> >> >> > >> > >> > -- >> > Zaki Rahaman >> > >> > >