Zaki,
Glad to hear it wasn't Pig's fault!
Can you post a description of what was going on with S3, or at least
how you fixed it?

-D

On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman <[email protected]> wrote:
> Okay fixed some problem with corrupted file transfers from S3... now wc -l
> produces the same 143710 records... so yea its not a problem... and now I am
> getting the correct result from both methods... not sure what went wrong...
> thanks for the help though guys.
>
> On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair <[email protected]> wrote:
>
>> Another thing to verify is that clickurl's position in the schema is
>> correct.
>> -Thejas
>>
>>
>>
>> On 11/19/09 11:43 AM, "Ashutosh Chauhan" <[email protected]>
>> wrote:
>>
>> > Hmm... You are sure that your records are separated by /n (newline)
>> > and fields by /t (tab).  If so, will it be possible you to upload your
>> > dataset (possibly smaller) somewhere so that someone can take a look
>> > at that.
>> >
>> > Ashutosh
>> >
>> > On Thu, Nov 19, 2009 at 14:35, zaki rahaman <[email protected]>
>> wrote:
>> >> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
>> >> [email protected]> wrote:
>> >>
>> >>> Hi Zaki,
>> >>>
>> >>> Just to narrow down the problem, can you do:
>> >>>
>> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
>> >>> dump A;
>> >>>
>> >>
>> >> This produced 143710 records;
>> >>
>> >>
>> >>> and
>> >>>
>> >>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
>> >>> timestamp:chararray,
>> >>> ip:chararray,
>> >>> userid:chararray,
>> >>> dist:chararray,
>> >>> clickid:chararray,
>> >>> usra:chararray,
>> >>> campaign:chararray,
>> >>> clickurl:chararray,
>> >>> plugin:chararray,
>> >>> tab:chararray,
>> >>> feature:chararray);
>> >>> dump A;
>> >>>
>> >>
>> >>
>> >> This produced 143710 records (so no problem there);
>> >>
>> >>
>> >>> and
>> >>>
>> >>> cut -f8 *week.46*clickLog.2009* | wc -l
>> >>>
>> >>
>> >>
>> >> This produced...
>> >> 175572
>> >>
>> >> Clearly, something is wrong...
>> >>
>> >>
>> >> Thanks,
>> >>> Ashutosh
>> >>>
>> >>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <[email protected]>
>> >>> wrote:
>> >>>> Hi All,
>> >>>>
>> >>>> I have the following mini-script running as part of a larger set of
>> >>>> scripts/workflow... however it seems like pig is dropping records as
>> when
>> >>> I
>> >>>> tried running the same thing as a simple grep | wc -l I get a
>> completely
>> >>>> different result (2500 with Pig vs. 3300). The Pig script is as
>> follows:
>> >>>>
>> >>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
>> >>>> (timestamp:chararray,
>> >>>> ip:chararray,
>> >>>> userid:chararray,
>> >>>> dist:chararray,
>> >>>> clickid:chararray,
>> >>>> usra:chararray,
>> >>>> campaign:chararray,
>> >>>> clickurl:chararray,
>> >>>> plugin:chararray,
>> >>>> tab:chararray,
>> >>>> feature:chararray);
>> >>>>
>> >>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
>> >>>>
>> >>>> dump B produces the following output:
>> >>>> 2009-11-19 18:50:46,013 [main] INFO
>> >>>>
>> >>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >>> er
>> >>>> - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
>> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >>>>
>> >>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >>> er
>> >>>> - Records written : 2502
>> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >>>>
>> >>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >>> er
>> >>>> - Bytes written : 0
>> >>>> 2009-11-19 18:50:46,058 [main] INFO
>> >>>>
>> >>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>> >>> er
>> >>>> - Success!
>> >>>>
>> >>>>
>> >>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
>> >>>> http://www.amazon | wc -l
>> >>>>
>> >>>> Both sets of inputs are the same files... and I'm not sure where the
>> >>>> discrepency is coming from. Any help would be greatly appreciated.
>> >>>>
>> >>>> --
>> >>>> Zaki Rahaman
>> >>>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Zaki Rahaman
>> >>
>>
>>
>
>
> --
> Zaki Rahaman
>

Reply via email to