Another thing to verify is that clickurl's position in the schema is
correct.
-Thejas



On 11/19/09 11:43 AM, "Ashutosh Chauhan" <[email protected]> wrote:

> Hmm... You are sure that your records are separated by /n (newline)
> and fields by /t (tab).  If so, will it be possible you to upload your
> dataset (possibly smaller) somewhere so that someone can take a look
> at that.
> 
> Ashutosh
> 
> On Thu, Nov 19, 2009 at 14:35, zaki rahaman <[email protected]> wrote:
>> On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan <
>> [email protected]> wrote:
>> 
>>> Hi Zaki,
>>> 
>>> Just to narrow down the problem, can you do:
>>> 
>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
>>> dump A;
>>> 
>> 
>> This produced 143710 records;
>> 
>> 
>>> and
>>> 
>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
>>> timestamp:chararray,
>>> ip:chararray,
>>> userid:chararray,
>>> dist:chararray,
>>> clickid:chararray,
>>> usra:chararray,
>>> campaign:chararray,
>>> clickurl:chararray,
>>> plugin:chararray,
>>> tab:chararray,
>>> feature:chararray);
>>> dump A;
>>> 
>> 
>> 
>> This produced 143710 records (so no problem there);
>> 
>> 
>>> and
>>> 
>>> cut -f8 *week.46*clickLog.2009* | wc -l
>>> 
>> 
>> 
>> This produced...
>> 175572
>> 
>> Clearly, something is wrong...
>> 
>> 
>> Thanks,
>>> Ashutosh
>>> 
>>> On Thu, Nov 19, 2009 at 14:03, zaki rahaman <[email protected]>
>>> wrote:
>>>> Hi All,
>>>> 
>>>> I have the following mini-script running as part of a larger set of
>>>> scripts/workflow... however it seems like pig is dropping records as when
>>> I
>>>> tried running the same thing as a simple grep | wc -l I get a completely
>>>> different result (2500 with Pig vs. 3300). The Pig script is as follows:
>>>> 
>>>> A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
>>>> (timestamp:chararray,
>>>> ip:chararray,
>>>> userid:chararray,
>>>> dist:chararray,
>>>> clickid:chararray,
>>>> usra:chararray,
>>>> campaign:chararray,
>>>> clickurl:chararray,
>>>> plugin:chararray,
>>>> tab:chararray,
>>>> feature:chararray);
>>>> 
>>>> B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
>>>> 
>>>> dump B produces the following output:
>>>> 2009-11-19 18:50:46,013 [main] INFO
>>>> 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>>> er
>>>> - Successfully stored result in: "s3://kikin-pig-test/amazonoutput2"
>>>> 2009-11-19 18:50:46,058 [main] INFO
>>>> 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>>> er
>>>> - Records written : 2502
>>>> 2009-11-19 18:50:46,058 [main] INFO
>>>> 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>>> er
>>>> - Bytes written : 0
>>>> 2009-11-19 18:50:46,058 [main] INFO
>>>> 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
>>> er
>>>> - Success!
>>>> 
>>>> 
>>>> The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
>>>> http://www.amazon | wc -l
>>>> 
>>>> Both sets of inputs are the same files... and I'm not sure where the
>>>> discrepency is coming from. Any help would be greatly appreciated.
>>>> 
>>>> --
>>>> Zaki Rahaman
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Zaki Rahaman
>> 

Reply via email to