Are .bz2 extensions supported in Hadoop 18.3
Hi All, Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3? I tried but interestingly the output was not what i expected versus what i got when my data was in uncompressed format. Thanks, Usman
Re: Are .bz2 extensions supported in Hadoop 18.3
I believe the cloudera 18.3 supports bzip2 On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com wrote: Hi All, Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3? I tried but interestingly the output was not what i expected versus what i got when my data was in uncompressed format. Thanks, Usman -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
RE: Are .bz2 extensions supported in Hadoop 18.3
Hi Usman, I'm running 0.18.3 from hadoop.apache.org, and have no issues with bz2 files. My experiments with these files have been through Pig. Hope this is useful to you. Best regards, Danny Gross -Original Message- From: Usman Waheed [mailto:usm...@opera.com] Sent: Wednesday, June 24, 2009 10:09 AM To: core-user@hadoop.apache.org Subject: Re: Are .bz2 extensions supported in Hadoop 18.3 The version (18.3) i am running in my cluster is the tar ball i got from hadoop.apache.org. So you are suggesting to use the Cloudera 18.3 which supports bzip2 correct? Thanks, Usman I believe the cloudera 18.3 supports bzip2 On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com wrote: Hi All, Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3? I tried but interestingly the output was not what i expected versus what i got when my data was in uncompressed format. Thanks, Usman
Re: Are .bz2 extensions supported in Hadoop 18.3
Hi Danny, Hmmm makes me wonder that i might be doing something wrong here. I imported just one .bz2 files into HDFS and then launched a map/reduce tasks executing the following command: /home/hadoop/hadoop/bin/hadoop jar /home/hadoop/hadoop/contrib/streaming/hadoop-streaming.jar -input /user/hadoop/logs/2009/06/22/ -output /user/hadoop/out1 -mapper map.pl -file map.pl -reducer reduce.pl -file reduce.pl -jobconf mapred.reduce.tasks=1* The .bz file was in the /user/hadoop/logs/2009/06/22 direcctory but the final output part-0 in /user/hadoop/out1 was meaningless. I was expecting key,value pairs but all i got was a count integer for example: 31,006, no errors were generated at all. When i ran the same command above with uncompressed file my output was fine giving me the correct key,value pairs. No errors were generated. Noted below is my map.pl and reduce.pl. Thanks for your help, Usman _*map.pl*_ #!/usr/bin/perl -w # # while (STDIN) { chomp; next if ( ! /^\d+/ ); my @fields = split(/;/); my $cookie = $fields[11]; print $cookie\t1\n; } _*reduce.pl*_ #!/usr/bin/perl -w # # while (STDIN) { chomp; ($key,$value) = split(/\t/); $count{$key} += $value; } foreach $k (keys %count) { $c = $count{$k}; print $k\t$c\n; } Hi Usman, I'm running 0.18.3 from hadoop.apache.org, and have no issues with bz2 files. My experiments with these files have been through Pig. Hope this is useful to you. Best regards, Danny Gross -Original Message- From: Usman Waheed [mailto:usm...@opera.com] Sent: Wednesday, June 24, 2009 10:09 AM To: core-user@hadoop.apache.org Subject: Re: Are .bz2 extensions supported in Hadoop 18.3 The version (18.3) i am running in my cluster is the tar ball i got from hadoop.apache.org. So you are suggesting to use the Cloudera 18.3 which supports bzip2 correct? Thanks, Usman I believe the cloudera 18.3 supports bzip2 On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com wrote: Hi All, Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3? I tried but interestingly the output was not what i expected versus what i got when my data was in uncompressed format. Thanks, Usman
Re: Are .bz2 extensions supported in Hadoop 18.3
On Wed, 24 Jun 2009 12:45:59 +0200, Usman Waheed wrote: Hi All, Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3? I tried but interestingly the output was not what i expected versus what i got when my data was in uncompressed format. Thanks, Usman Not AFAIK, but we have added bzip2 support as of 0.19 (see JIRA HADOOP-3646), and have splitting support working (see JIRA HADOOP-4012) as a patch. Getting HADOOP-4012 committed has been painful, but it seems close. -John Heidemann
RE: Are .bz2 extensions supported in Hadoop 18.3
Hi Usman, I believe your issue is specifically in the contrib/ hadoop-streaming.jar. I ran a test python job with hadoop-streaming.jar on a bz2 file with no errors. However, the output was junk. Pig has no issue with bz2 files. According to http://hadoop.apache.org/core/docs/r0.15.2/streaming.html, streaming.jar reads the file line-by-line. It seems that you would have to specify another plugin via the -inputformat for your app in order for hadoop-streaming.jar to properly handle the compressed file. I'm not much help beyond that. Perhaps you might try Pig (this is such a cool platform!) Hope it helps. Best regards, Danny -Original Message- From: Usman Waheed [mailto:usm...@opera.com] Sent: Wednesday, June 24, 2009 10:32 AM To: core-user@hadoop.apache.org Subject: Re: Are .bz2 extensions supported in Hadoop 18.3 Hi Danny, Hmmm makes me wonder that i might be doing something wrong here. I imported just one .bz2 files into HDFS and then launched a map/reduce tasks executing the following command: /home/hadoop/hadoop/bin/hadoop jar /home/hadoop/hadoop/contrib/streaming/hadoop-streaming.jar -input /user/hadoop/logs/2009/06/22/ -output /user/hadoop/out1 -mapper map.pl -file map.pl -reducer reduce.pl -file reduce.pl -jobconf mapred.reduce.tasks=1* The .bz file was in the /user/hadoop/logs/2009/06/22 direcctory but the final output part-0 in /user/hadoop/out1 was meaningless. I was expecting key,value pairs but all i got was a count integer for example: 31,006, no errors were generated at all. When i ran the same command above with uncompressed file my output was fine giving me the correct key,value pairs. No errors were generated. Noted below is my map.pl and reduce.pl. Thanks for your help, Usman _*map.pl*_ #!/usr/bin/perl -w # # while (STDIN) { chomp; next if ( ! /^\d+/ ); my @fields = split(/;/); my $cookie = $fields[11]; print $cookie\t1\n; } _*reduce.pl*_ #!/usr/bin/perl -w # # while (STDIN) { chomp; ($key,$value) = split(/\t/); $count{$key} += $value; } foreach $k (keys %count) { $c = $count{$k}; print $k\t$c\n; } Hi Usman, I'm running 0.18.3 from hadoop.apache.org, and have no issues with bz2 files. My experiments with these files have been through Pig. Hope this is useful to you. Best regards, Danny Gross -Original Message- From: Usman Waheed [mailto:usm...@opera.com] Sent: Wednesday, June 24, 2009 10:09 AM To: core-user@hadoop.apache.org Subject: Re: Are .bz2 extensions supported in Hadoop 18.3 The version (18.3) i am running in my cluster is the tar ball i got from hadoop.apache.org. So you are suggesting to use the Cloudera 18.3 which supports bzip2 correct? Thanks, Usman I believe the cloudera 18.3 supports bzip2 On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com wrote: Hi All, Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3? I tried but interestingly the output was not what i expected versus what i got when my data was in uncompressed format. Thanks, Usman
Re: Are .bz2 extensions supported in Hadoop 18.3
Cool, Thanks will check it out. -Usman Hi Usman, I believe your issue is specifically in the contrib/ hadoop-streaming.jar. I ran a test python job with hadoop-streaming.jar on a bz2 file with no errors. However, the output was junk. Pig has no issue with bz2 files. According to http://hadoop.apache.org/core/docs/r0.15.2/streaming.html, streaming.jar reads the file line-by-line. It seems that you would have to specify another plugin via the -inputformat for your app in order for hadoop-streaming.jar to properly handle the compressed file. I'm not much help beyond that. Perhaps you might try Pig (this is such a cool platform!) Hope it helps. Best regards, Danny -Original Message- From: Usman Waheed [mailto:usm...@opera.com] Sent: Wednesday, June 24, 2009 10:32 AM To: core-user@hadoop.apache.org Subject: Re: Are .bz2 extensions supported in Hadoop 18.3 Hi Danny, Hmmm makes me wonder that i might be doing something wrong here. I imported just one .bz2 files into HDFS and then launched a map/reduce tasks executing the following command: /home/hadoop/hadoop/bin/hadoop jar /home/hadoop/hadoop/contrib/streaming/hadoop-streaming.jar -input /user/hadoop/logs/2009/06/22/ -output /user/hadoop/out1 -mapper map.pl -file map.pl -reducer reduce.pl -file reduce.pl -jobconf mapred.reduce.tasks=1* The .bz file was in the /user/hadoop/logs/2009/06/22 direcctory but the final output part-0 in /user/hadoop/out1 was meaningless. I was expecting key,value pairs but all i got was a count integer for example: 31,006, no errors were generated at all. When i ran the same command above with uncompressed file my output was fine giving me the correct key,value pairs. No errors were generated. Noted below is my map.pl and reduce.pl. Thanks for your help, Usman _*map.pl*_ #!/usr/bin/perl -w # # while (STDIN) { chomp; next if ( ! /^\d+/ ); my @fields = split(/;/); my $cookie = $fields[11]; print $cookie\t1\n; } _*reduce.pl*_ #!/usr/bin/perl -w # # while (STDIN) { chomp; ($key,$value) = split(/\t/); $count{$key} += $value; } foreach $k (keys %count) { $c = $count{$k}; print $k\t$c\n; } Hi Usman, I'm running 0.18.3 from hadoop.apache.org, and have no issues with bz2 files. My experiments with these files have been through Pig. Hope this is useful to you. Best regards, Danny Gross -Original Message- From: Usman Waheed [mailto:usm...@opera.com] Sent: Wednesday, June 24, 2009 10:09 AM To: core-user@hadoop.apache.org Subject: Re: Are .bz2 extensions supported in Hadoop 18.3 The version (18.3) i am running in my cluster is the tar ball i got from hadoop.apache.org. So you are suggesting to use the Cloudera 18.3 which supports bzip2 correct? Thanks, Usman I believe the cloudera 18.3 supports bzip2 On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com wrote: Hi All, Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3? I tried but interestingly the output was not what i expected versus what i got when my data was in uncompressed format. Thanks, Usman
Re: Are .bz2 extensions supported in Hadoop 18.3
This is correct - thanks for the note Jason. You can see the current patch list for Cloudera's Distribution (based on 18.3) at: http://www.cloudera.com/hadoop-manifest In addition to Bzip2, we have patched in: DBInputFormat, the fair scheduler, job level task limiting, soft fd leak fix, a fix for HDFS under-replication, shuffle improvements, EC2/S3 improvements, and Sqoop - database import for Hadoop. You can download RPMs and Ubuntu packages as well as preconfigured EC2 images from: http://www.cloudera.com/hadoop Cheers, Christophe On Wed, Jun 24, 2009 at 6:47 AM, jason hadoopjason.had...@gmail.com wrote: I believe the cloudera 18.3 supports bzip2 On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com wrote: Hi All, Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3? I tried but interestingly the output was not what i expected versus what i got when my data was in uncompressed format. Thanks, Usman -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- get hadoop: cloudera.com/hadoop online training: cloudera.com/hadoop-training blog: cloudera.com/blog twitter: twitter.com/cloudera
Re: Are .bz2 extensions supported in Hadoop 18.3
Very cool, we are using Debian and I checked Cloudera's website. You have packages for the Debian platform. Will check it out and install on a test cluster. Thanks much, Usman This is correct - thanks for the note Jason. You can see the current patch list for Cloudera's Distribution (based on 18.3) at: http://www.cloudera.com/hadoop-manifest In addition to Bzip2, we have patched in: DBInputFormat, the fair scheduler, job level task limiting, soft fd leak fix, a fix for HDFS under-replication, shuffle improvements, EC2/S3 improvements, and Sqoop - database import for Hadoop. You can download RPMs and Ubuntu packages as well as preconfigured EC2 images from: http://www.cloudera.com/hadoop Cheers, Christophe On Wed, Jun 24, 2009 at 6:47 AM, jason hadoopjason.had...@gmail.com wrote: I believe the cloudera 18.3 supports bzip2 On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com wrote: Hi All, Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3? I tried but interestingly the output was not what i expected versus what i got when my data was in uncompressed format. Thanks, Usman -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/