Hi,

This reminds me of something else, though, that I took the latest patch for PIG-911 (sequence file reader) and found it skipped records


https://issues.apache.org/jira/browse/PIG-911

What I found is that the condition in getNext() would miss records:

if (reader != null && (reader.getPosition() < end || ! reader.syncSeen()) && reader.next(key, value)) {
...
}

I had to change it to:

if (reader != null && reader.next(key,value) && (reader.getPosition() < end || !reader.syncSeen())) {
...
}

(also ended up breaking out to read(key) and get the below to support reading other types than Writable)

This only happened when I file files pig read where more than one block; ie, the records dropped were around block boundaries.

has anyone noticed this?

thx,
-sr

Sam Rash
s...@ning.com



On Nov 19, 2009, at 4:48 PM, Dmitriy Ryaboy wrote:

Zaki,
Glad to hear it wasn't Pig's fault!
Can you post a description of what was going on with S3, or at least
how you fixed it?

-D

On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman <zaki.raha...@gmail.com> wrote: > Okay fixed some problem with corrupted file transfers from S3... now wc -l > produces the same 143710 records... so yea its not a problem... and now I am > getting the correct result from both methods... not sure what went wrong...
> thanks for the help though guys.
>
> On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <te...@yahoo-inc.com> wrote:
>
>> Another thing to verify is that clickurl's position in the schema is
>> correct.
>> -Thejas
>>
>>
>>
>> On 11/19/09 11:43 AM, "Ashutosh Chauhan" <ashutosh.chau...@gmail.com >
>> wrote:
>>
>> > Hmm... You are sure that your records are separated by /n (newline) >> > and fields by /t (tab). If so, will it be possible you to upload your >> > dataset (possibly smaller) somewhere so that someone can take a look
>> > at that.
>> >
>> > Ashutosh
>> >
>> > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <zaki.raha...@gmail.com >
>> wrote:
>> >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
>> >> ashutosh.chau...@gmail.com> wrote:
>> >>
>> >>> Hi Zaki,
>> >>>
>> >>> Just to narrow down the problem, can you do:
>> >>>
>> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
>> >>> dump A;
>> >>>
>> >>
>> >> This produced 143710 records;
>> >>
>> >>
>> >>> and
>> >>>
>> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
>> >>> timestamp:chararray,
>> >>> ip:chararray,
>> >>> userid:chararray,
>> >>> dist:chararray,
>> >>> clickid:chararray,
>> >>> usra:chararray,
>> >>> campaign:chararray,
>> >>> clickurl:chararray,
>> >>> plugin:chararray,
>> >>> tab:chararray,
>> >>> feature:chararray);
>> >>> dump A;
>> >>>
>> >>
>> >>
>> >> This produced 143710 records (so no problem there);
>> >>
>> >>
>> >>> and
>> >>>
>> >>> cut -f8 *week.46*clickLog.2009* | wc -l
>> >>>
>> >>
>> >>
>> >> This produced...
>> >> 175572
>> >>
>> >> Clearly, something is wrong...
>> >>
>> >>
>> >> Thanks,
>> >>> Ashutosh
>> >>>
>> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <zaki.raha...@gmail.com >
>> >>> wrote:
>> >>>> Hi All,
>> >>>>
>> >>>> I have the following mini-script running as part of a larger set of >> >>>> scripts/workflow... however it seems like pig is dropping records as
>> when
>> >>> I
>> >>>> tried running the same thing as a simple grep | wc -l I get a
>> completely
>> >>>> different result (2500 with Pig vs. 3300). The Pig script is as
>> follows:
>> >>>>
>> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
>> >>>> (timestamp:chararray,
>> >>>> ip:chararray,
>> >>>> userid:chararray,
>> >>>> dist:chararray,
>> >>>> clickid:chararray,
>> >>>> usra:chararray,
>> >>>> campaign:chararray,
>> >>>> clickurl:chararray,
>> >>>> plugin:chararray,
>> >>>> tab:chararray,
>> >>>> feature:chararray);
>> >>>>
>> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
>> >>>>
>> >>>> dump B produces the following output:
>> >>>> 2009-11-19 18:50:46,013 [main] INFO
>> >>>>
>> >>>
>> org .apache .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >>> er
>> >>>> - Successfully stored result in: "s3://kikin-pig-test/ amazonoutput2"
>> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >>>>
>> >>>
>> org .apache .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >>> er
>> >>>> - Records written : 2502
>> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >>>>
>> >>>
>> org .apache .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >>> er
>> >>>> - Bytes written : 0
>> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >>>>
>> >>>
>> org .apache .pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >>> er
>> >>>> - Success!
>> >>>>
>> >>>>
>> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
>> >>>> http://www.amazon | wc -l
>> >>>>
>> >>>> Both sets of inputs are the same files... and I'm not sure where the >> >>>> discrepency is coming from. Any help would be greatly appreciated.
>> >>>>
>> >>>> --
>> >>>> Zaki Rahaman
>> >>>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Zaki Rahaman
>> >>
>>
>>
>
>
> --
> Zaki Rahaman
>


Reply via email to