Hi Usman,

I believe your issue is specifically in the contrib/
hadoop-streaming.jar.   I ran a test python job with
hadoop-streaming.jar on a bz2 file with no errors.  However, the output
was junk.

Pig has no issue with bz2 files.

According to http://hadoop.apache.org/core/docs/r0.15.2/streaming.html,
streaming.jar reads the file line-by-line.  It seems that you would have
to specify another plugin via the -inputformat for your app in order for
hadoop-streaming.jar to properly handle the compressed file. I'm not
much help beyond that. Perhaps you might try Pig (this is such a cool
platform!)

Hope it helps.

Best regards,

Danny

-----Original Message-----
From: Usman Waheed [mailto:usm...@opera.com] 
Sent: Wednesday, June 24, 2009 10:32 AM
To: core-user@hadoop.apache.org
Subject: Re: Are .bz2 extensions supported in Hadoop 18.3

Hi Danny,

Hmmm makes me wonder that i might be doing something wrong here. I 
imported just one .bz2 files into HDFS and then launched a map/reduce 
tasks executing the following command:

/home/hadoop/hadoop/bin/hadoop jar 
/home/hadoop/hadoop/contrib/streaming/hadoop-streaming.jar -input 
/user/hadoop/logs/2009/06/22/ -output /user/hadoop/out1 -mapper map.pl 
-file map.pl -reducer reduce.pl -file reduce.pl -jobconf 
mapred.reduce.tasks=1*

The .bz file was in the /user/hadoop/logs/2009/06/22 direcctory but the 
final output part-00000 in /user/hadoop/out1 was meaningless. I was 
expecting key,value pairs but all i got was a count integer for example:

31,006, no errors were generated at all.

When i ran the same command above with uncompressed file my output was 
fine giving me the correct key,value pairs. No errors were generated.

Noted below is my map.pl and reduce.pl.

Thanks for your help,
Usman

_*map.pl*_

#!/usr/bin/perl -w
#
#

while (<STDIN>) {

    chomp;
    next if ( ! /^\d+/ );
    my @fields = split(/;/);
    my $cookie = $fields[11];
    print "$cookie\t1\n";

}


_*reduce.pl*_

#!/usr/bin/perl -w
#
#

while (<STDIN>) {

    chomp;
    ($key,$value)  = split(/\t/);
    $count{$key} += $value;
   
}

foreach $k (keys %count) {
    $c = $count{$k};
    print "$k\t$c\n";
}



> Hi Usman,
>
> I'm running 0.18.3 from hadoop.apache.org, and have no issues with bz2
> files.   My experiments with these files have been through Pig.  Hope
> this is useful to you.
>
> Best regards,
>
> Danny Gross
>
> -----Original Message-----
> From: Usman Waheed [mailto:usm...@opera.com] 
> Sent: Wednesday, June 24, 2009 10:09 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Are .bz2 extensions supported in Hadoop 18.3
>
> The version (18.3) i am running in my cluster is the tar ball i got
from
>
> hadoop.apache.org.
> So you are suggesting to use the Cloudera 18.3 which supports bzip2
> correct?
>
> Thanks,
> Usman
>
>   
>> I believe the cloudera 18.3 supports bzip2
>>
>> On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed <usm...@opera.com>
>>     
> wrote:
>   
>>   
>>     
>>> Hi All,
>>>
>>> Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3?
>>> I tried but interestingly the output was not what i expected versus
>>>       
> what i
>   
>>> got when my data was in uncompressed format.
>>>
>>> Thanks,
>>> Usman
>>>
>>>     
>>>       
>>
>>   
>>     

Reply via email to