Cool, Thanks will check it out.
-Usman
Hi Usman,
I believe your issue is specifically in the contrib/
hadoop-streaming.jar. I ran a test python job with
hadoop-streaming.jar on a bz2 file with no errors. However, the output
was junk.
Pig has no issue with bz2 files.
According to http://hadoop.apache.org/core/docs/r0.15.2/streaming.html,
streaming.jar reads the file line-by-line. It seems that you would have
to specify another plugin via the -inputformat for your app in order for
hadoop-streaming.jar to properly handle the compressed file. I'm not
much help beyond that. Perhaps you might try Pig (this is such a cool
platform!)
Hope it helps.
Best regards,
Danny
-----Original Message-----
From: Usman Waheed [mailto:usm...@opera.com]
Sent: Wednesday, June 24, 2009 10:32 AM
To: core-user@hadoop.apache.org
Subject: Re: Are .bz2 extensions supported in Hadoop 18.3
Hi Danny,
Hmmm makes me wonder that i might be doing something wrong here. I
imported just one .bz2 files into HDFS and then launched a map/reduce
tasks executing the following command:
/home/hadoop/hadoop/bin/hadoop jar
/home/hadoop/hadoop/contrib/streaming/hadoop-streaming.jar -input
/user/hadoop/logs/2009/06/22/ -output /user/hadoop/out1 -mapper map.pl
-file map.pl -reducer reduce.pl -file reduce.pl -jobconf
mapred.reduce.tasks=1*
The .bz file was in the /user/hadoop/logs/2009/06/22 direcctory but the
final output part-00000 in /user/hadoop/out1 was meaningless. I was
expecting key,value pairs but all i got was a count integer for example:
31,006, no errors were generated at all.
When i ran the same command above with uncompressed file my output was
fine giving me the correct key,value pairs. No errors were generated.
Noted below is my map.pl and reduce.pl.
Thanks for your help,
Usman
_*map.pl*_
#!/usr/bin/perl -w
#
#
while (<STDIN>) {
chomp;
next if ( ! /^\d+/ );
my @fields = split(/;/);
my $cookie = $fields[11];
print "$cookie\t1\n";
}
_*reduce.pl*_
#!/usr/bin/perl -w
#
#
while (<STDIN>) {
chomp;
($key,$value) = split(/\t/);
$count{$key} += $value;
}
foreach $k (keys %count) {
$c = $count{$k};
print "$k\t$c\n";
}
Hi Usman,
I'm running 0.18.3 from hadoop.apache.org, and have no issues with bz2
files. My experiments with these files have been through Pig. Hope
this is useful to you.
Best regards,
Danny Gross
-----Original Message-----
From: Usman Waheed [mailto:usm...@opera.com]
Sent: Wednesday, June 24, 2009 10:09 AM
To: core-user@hadoop.apache.org
Subject: Re: Are .bz2 extensions supported in Hadoop 18.3
The version (18.3) i am running in my cluster is the tar ball i got
from
hadoop.apache.org.
So you are suggesting to use the Cloudera 18.3 which supports bzip2
correct?
Thanks,
Usman
I believe the cloudera 18.3 supports bzip2
On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed <usm...@opera.com>
wrote:
Hi All,
Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3?
I tried but interestingly the output was not what i expected versus
what i
got when my data was in uncompressed format.
Thanks,
Usman