Hmm... You are sure that your records are separated by /n (newline)
and fields by /t (tab).  If so, will it be possible you to upload your
dataset (possibly smaller) somewhere so that someone can take a look
at that.

Ashutosh

On Thu, Nov 19, 2009 at 14:35, zaki rahaman <[email protected]> wrote:
> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
> [email protected]> wrote:
>
>> Hi Zaki,
>>
>> Just to narrow down the problem, can you do:
>>
>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
>> dump A;
>>
>
> This produced 143710 records;
>
>
>> and
>>
>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
>> timestamp:chararray,
>> ip:chararray,
>> userid:chararray,
>> dist:chararray,
>> clickid:chararray,
>> usra:chararray,
>> campaign:chararray,
>> clickurl:chararray,
>> plugin:chararray,
>> tab:chararray,
>> feature:chararray);
>> dump A;
>>
>
>
> This produced 143710 records (so no problem there);
>
>
>> and
>>
>> cut -f8 *week.46*clickLog.2009* | wc -l
>>
>
>
> This produced...
> 175572
>
> Clearly, something is wrong...
>
>
> Thanks,
>> Ashutosh
>>
>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <[email protected]>
>> wrote:
>> > Hi All,
>> >
>> > I have the following mini-script running as part of a larger set of
>> > scripts/workflow... however it seems like pig is dropping records as when
>> I
>> > tried running the same thing as a simple grep | wc -l I get a completely
>> > different result (2500 with Pig vs. 3300). The Pig script is as follows:
>> >
>> > A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
>> > (timestamp:chararray,
>> > ip:chararray,
>> > userid:chararray,
>> > dist:chararray,
>> > clickid:chararray,
>> > usra:chararray,
>> > campaign:chararray,
>> > clickurl:chararray,
>> > plugin:chararray,
>> > tab:chararray,
>> > feature:chararray);
>> >
>> > B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
>> >
>> > dump B produces the following output:
>> > 2009-11-19 18:50:46,013 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
>> > 2009-11-19 18:50:46,058 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - Records written : 2502
>> > 2009-11-19 18:50:46,058 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - Bytes written : 0
>> > 2009-11-19 18:50:46,058 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - Success!
>> >
>> >
>> > The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
>> > http://www.amazon | wc -l
>> >
>> > Both sets of inputs are the same files... and I'm not sure where the
>> > discrepency is coming from. Any help would be greatly appreciated.
>> >
>> > --
>> > Zaki Rahaman
>> >
>>
>
>
>
> --
> Zaki Rahaman
>

Reply via email to