Sam,
Can you post your changes to a Jira?
-D

On Fri, Nov 20, 2009 at 1:28 PM, Sam Rash <s...@ning.com> wrote:
> Hi,
>
> This reminds me of something else, though, that I took the latest patch for
> PIG-911 (sequence file reader) and found it skipped records
>
> https://issues.apache.org/jira/browse/PIG-911
>
> What I found is that the condition in getNext() would miss records:
>
> if (reader != null && (reader.getPosition() < end || !reader.syncSeen()) &&
> reader.next(key, value)) {
> ...
> }
>
> I had to change it to:
>
> if (reader != null && reader.next(key,value) && (reader.getPosition() < end
> || !reader.syncSeen())) {
> ...
> }
>
> (also ended up breaking out to read(key) and get the below to support
> reading other types than Writable)
>
> This only happened when I file files pig read where more than one block; ie,
> the records dropped were around block boundaries.
>
> has anyone noticed this?
>
> thx,
> -sr
>
> Sam Rash
> s...@ning.com
>
>
>
> On Nov 19, 2009, at 4:48 PM, Dmitriy Ryaboy wrote:
>
>> Zaki,
>> Glad to hear it wasn't Pig's fault!
>> Can you post a description of what was going on with S3, or at least
>> how you fixed it?
>>
>> -D
>>
>> On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman <zaki.raha...@gmail.com>
>> wrote:
>> > Okay fixed some problem with corrupted file transfers from S3... now wc
>> > -l
>> > produces the same 143710 records... so yea its not a problem... and now
>> > I am
>> > getting the correct result from both methods... not sure what went
>> > wrong...
>> > thanks for the help though guys.
>> >
>> > On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <te...@yahoo-inc.com>
>> > wrote:
>> >
>> >> Another thing to verify is that clickurl's position in the schema is
>> >> correct.
>> >> -Thejas
>> >>
>> >>
>> >>
>> >> On 11/19/09 11:43 AM, "Ashutosh Chauhan" <ashutosh.chau...@gmail.com>
>> >> wrote:
>> >>
>> >> > Hmm... You are sure that your records are separated by /n (newline)
>> >> > and fields by /t (tab).  If so, will it be possible you to upload
>> >> > your
>> >> > dataset (possibly smaller) somewhere so that someone can take a look
>> >> > at that.
>> >> >
>> >> > Ashutosh
>> >> >
>> >> > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <zaki.raha...@gmail.com>
>> >> wrote:
>> >> >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
>> >> >> ashutosh.chau...@gmail.com> wrote:
>> >> >>
>> >> >>> Hi Zaki,
>> >> >>>
>> >> >>> Just to narrow down the problem, can you do:
>> >> >>>
>> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
>> >> >>> dump A;
>> >> >>>
>> >> >>
>> >> >> This produced 143710 records;
>> >> >>
>> >> >>
>> >> >>> and
>> >> >>>
>> >> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
>> >> >>> timestamp:chararray,
>> >> >>> ip:chararray,
>> >> >>> userid:chararray,
>> >> >>> dist:chararray,
>> >> >>> clickid:chararray,
>> >> >>> usra:chararray,
>> >> >>> campaign:chararray,
>> >> >>> clickurl:chararray,
>> >> >>> plugin:chararray,
>> >> >>> tab:chararray,
>> >> >>> feature:chararray);
>> >> >>> dump A;
>> >> >>>
>> >> >>
>> >> >>
>> >> >> This produced 143710 records (so no problem there);
>> >> >>
>> >> >>
>> >> >>> and
>> >> >>>
>> >> >>> cut -f8 *week.46*clickLog.2009* | wc -l
>> >> >>>
>> >> >>
>> >> >>
>> >> >> This produced...
>> >> >> 175572
>> >> >>
>> >> >> Clearly, something is wrong...
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >>> Ashutosh
>> >> >>>
>> >> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman
>> >> >>> <zaki.raha...@gmail.com>
>> >> >>> wrote:
>> >> >>>> Hi All,
>> >> >>>>
>> >> >>>> I have the following mini-script running as part of a larger set
>> >> >>>> of
>> >> >>>> scripts/workflow... however it seems like pig is dropping records
>> >> >>>> as
>> >> when
>> >> >>> I
>> >> >>>> tried running the same thing as a simple grep | wc -l I get a
>> >> completely
>> >> >>>> different result (2500 with Pig vs. 3300). The Pig script is as
>> >> follows:
>> >> >>>>
>> >> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
>> >> >>>> (timestamp:chararray,
>> >> >>>> ip:chararray,
>> >> >>>> userid:chararray,
>> >> >>>> dist:chararray,
>> >> >>>> clickid:chararray,
>> >> >>>> usra:chararray,
>> >> >>>> campaign:chararray,
>> >> >>>> clickurl:chararray,
>> >> >>>> plugin:chararray,
>> >> >>>> tab:chararray,
>> >> >>>> feature:chararray);
>> >> >>>>
>> >> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
>> >> >>>>
>> >> >>>> dump B produces the following output:
>> >> >>>> 2009-11-19 18:50:46,013 [main] INFO
>> >> >>>>
>> >> >>>
>> >>
>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >> >>> er
>> >> >>>> - Successfully stored result in:
>> >> >>>> "s3://kikin-pig-test/amazonoutput2"
>> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >> >>>>
>> >> >>>
>> >>
>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >> >>> er
>> >> >>>> - Records written : 2502
>> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >> >>>>
>> >> >>>
>> >>
>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >> >>> er
>> >> >>>> - Bytes written : 0
>> >> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >> >>>>
>> >> >>>
>> >>
>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >> >>> er
>> >> >>>> - Success!
>> >> >>>>
>> >> >>>>
>> >> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
>> >> >>>> http://www.amazon | wc -l
>> >> >>>>
>> >> >>>> Both sets of inputs are the same files... and I'm not sure where
>> >> >>>> the
>> >> >>>> discrepency is coming from. Any help would be greatly appreciated.
>> >> >>>>
>> >> >>>> --
>> >> >>>> Zaki Rahaman
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Zaki Rahaman
>> >> >>
>> >>
>> >>
>> >
>> >
>> > --
>> > Zaki Rahaman
>> >
>>
>
>

Reply via email to