Are .bz2 extensions supported in Hadoop 18.3

2009-06-24 Thread Usman Waheed

Hi All,

Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3?
I tried but interestingly the output was not what i expected versus what 
i got when my data was in uncompressed format.


Thanks,
Usman


Re: Are .bz2 extensions supported in Hadoop 18.3

2009-06-24 Thread jason hadoop
I believe the cloudera 18.3 supports bzip2

On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com wrote:

 Hi All,

 Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3?
 I tried but interestingly the output was not what i expected versus what i
 got when my data was in uncompressed format.

 Thanks,
 Usman




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


RE: Are .bz2 extensions supported in Hadoop 18.3

2009-06-24 Thread Gross, Danny
Hi Usman,

I'm running 0.18.3 from hadoop.apache.org, and have no issues with bz2
files.   My experiments with these files have been through Pig.  Hope
this is useful to you.

Best regards,

Danny Gross

-Original Message-
From: Usman Waheed [mailto:usm...@opera.com] 
Sent: Wednesday, June 24, 2009 10:09 AM
To: core-user@hadoop.apache.org
Subject: Re: Are .bz2 extensions supported in Hadoop 18.3

The version (18.3) i am running in my cluster is the tar ball i got from

hadoop.apache.org.
So you are suggesting to use the Cloudera 18.3 which supports bzip2
correct?

Thanks,
Usman

 I believe the cloudera 18.3 supports bzip2

 On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com
wrote:

   
 Hi All,

 Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3?
 I tried but interestingly the output was not what i expected versus
what i
 got when my data was in uncompressed format.

 Thanks,
 Usman

 



   



Re: Are .bz2 extensions supported in Hadoop 18.3

2009-06-24 Thread Usman Waheed

Hi Danny,

Hmmm makes me wonder that i might be doing something wrong here. I 
imported just one .bz2 files into HDFS and then launched a map/reduce 
tasks executing the following command:


/home/hadoop/hadoop/bin/hadoop jar 
/home/hadoop/hadoop/contrib/streaming/hadoop-streaming.jar -input 
/user/hadoop/logs/2009/06/22/ -output /user/hadoop/out1 -mapper map.pl 
-file map.pl -reducer reduce.pl -file reduce.pl -jobconf 
mapred.reduce.tasks=1*


The .bz file was in the /user/hadoop/logs/2009/06/22 direcctory but the 
final output part-0 in /user/hadoop/out1 was meaningless. I was 
expecting key,value pairs but all i got was a count integer for example: 
31,006, no errors were generated at all.


When i ran the same command above with uncompressed file my output was 
fine giving me the correct key,value pairs. No errors were generated.


Noted below is my map.pl and reduce.pl.

Thanks for your help,
Usman

_*map.pl*_

#!/usr/bin/perl -w
#
#

while (STDIN) {

   chomp;
   next if ( ! /^\d+/ );
   my @fields = split(/;/);
   my $cookie = $fields[11];
   print $cookie\t1\n;

}


_*reduce.pl*_

#!/usr/bin/perl -w
#
#

while (STDIN) {

   chomp;
   ($key,$value)  = split(/\t/);
   $count{$key} += $value;
  
}


foreach $k (keys %count) {
   $c = $count{$k};
   print $k\t$c\n;
}




Hi Usman,

I'm running 0.18.3 from hadoop.apache.org, and have no issues with bz2
files.   My experiments with these files have been through Pig.  Hope
this is useful to you.

Best regards,

Danny Gross

-Original Message-
From: Usman Waheed [mailto:usm...@opera.com] 
Sent: Wednesday, June 24, 2009 10:09 AM

To: core-user@hadoop.apache.org
Subject: Re: Are .bz2 extensions supported in Hadoop 18.3

The version (18.3) i am running in my cluster is the tar ball i got from

hadoop.apache.org.
So you are suggesting to use the Cloudera 18.3 which supports bzip2
correct?

Thanks,
Usman

  

I believe the cloudera 18.3 supports bzip2

On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com


wrote:
  
  


Hi All,

Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3?
I tried but interestingly the output was not what i expected versus
  

what i
  

got when my data was in uncompressed format.

Thanks,
Usman


  


  





Re: Are .bz2 extensions supported in Hadoop 18.3

2009-06-24 Thread John Heidemann
On Wed, 24 Jun 2009 12:45:59 +0200, Usman Waheed wrote: 
Hi All,

Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3?
I tried but interestingly the output was not what i expected versus
what i got when my data was in uncompressed format.

Thanks,
Usman


Not AFAIK, but we have added bzip2 support as of 0.19
(see JIRA HADOOP-3646),
and have splitting support working (see JIRA HADOOP-4012) as a patch.
Getting HADOOP-4012 committed has been painful,
but it seems close.

   -John Heidemann




RE: Are .bz2 extensions supported in Hadoop 18.3

2009-06-24 Thread Gross, Danny
Hi Usman,

I believe your issue is specifically in the contrib/
hadoop-streaming.jar.   I ran a test python job with
hadoop-streaming.jar on a bz2 file with no errors.  However, the output
was junk.

Pig has no issue with bz2 files.

According to http://hadoop.apache.org/core/docs/r0.15.2/streaming.html,
streaming.jar reads the file line-by-line.  It seems that you would have
to specify another plugin via the -inputformat for your app in order for
hadoop-streaming.jar to properly handle the compressed file. I'm not
much help beyond that. Perhaps you might try Pig (this is such a cool
platform!)

Hope it helps.

Best regards,

Danny

-Original Message-
From: Usman Waheed [mailto:usm...@opera.com] 
Sent: Wednesday, June 24, 2009 10:32 AM
To: core-user@hadoop.apache.org
Subject: Re: Are .bz2 extensions supported in Hadoop 18.3

Hi Danny,

Hmmm makes me wonder that i might be doing something wrong here. I 
imported just one .bz2 files into HDFS and then launched a map/reduce 
tasks executing the following command:

/home/hadoop/hadoop/bin/hadoop jar 
/home/hadoop/hadoop/contrib/streaming/hadoop-streaming.jar -input 
/user/hadoop/logs/2009/06/22/ -output /user/hadoop/out1 -mapper map.pl 
-file map.pl -reducer reduce.pl -file reduce.pl -jobconf 
mapred.reduce.tasks=1*

The .bz file was in the /user/hadoop/logs/2009/06/22 direcctory but the 
final output part-0 in /user/hadoop/out1 was meaningless. I was 
expecting key,value pairs but all i got was a count integer for example:

31,006, no errors were generated at all.

When i ran the same command above with uncompressed file my output was 
fine giving me the correct key,value pairs. No errors were generated.

Noted below is my map.pl and reduce.pl.

Thanks for your help,
Usman

_*map.pl*_

#!/usr/bin/perl -w
#
#

while (STDIN) {

chomp;
next if ( ! /^\d+/ );
my @fields = split(/;/);
my $cookie = $fields[11];
print $cookie\t1\n;

}


_*reduce.pl*_

#!/usr/bin/perl -w
#
#

while (STDIN) {

chomp;
($key,$value)  = split(/\t/);
$count{$key} += $value;
   
}

foreach $k (keys %count) {
$c = $count{$k};
print $k\t$c\n;
}



 Hi Usman,

 I'm running 0.18.3 from hadoop.apache.org, and have no issues with bz2
 files.   My experiments with these files have been through Pig.  Hope
 this is useful to you.

 Best regards,

 Danny Gross

 -Original Message-
 From: Usman Waheed [mailto:usm...@opera.com] 
 Sent: Wednesday, June 24, 2009 10:09 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Are .bz2 extensions supported in Hadoop 18.3

 The version (18.3) i am running in my cluster is the tar ball i got
from

 hadoop.apache.org.
 So you are suggesting to use the Cloudera 18.3 which supports bzip2
 correct?

 Thanks,
 Usman

   
 I believe the cloudera 18.3 supports bzip2

 On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com
 
 wrote:
   
   
 
 Hi All,

 Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3?
 I tried but interestingly the output was not what i expected versus
   
 what i
   
 got when my data was in uncompressed format.

 Thanks,
 Usman

 
   

   
 



Re: Are .bz2 extensions supported in Hadoop 18.3

2009-06-24 Thread Usman Waheed

Cool, Thanks will check it out.
-Usman

Hi Usman,

I believe your issue is specifically in the contrib/
hadoop-streaming.jar.   I ran a test python job with
hadoop-streaming.jar on a bz2 file with no errors.  However, the output
was junk.

Pig has no issue with bz2 files.

According to http://hadoop.apache.org/core/docs/r0.15.2/streaming.html,
streaming.jar reads the file line-by-line.  It seems that you would have
to specify another plugin via the -inputformat for your app in order for
hadoop-streaming.jar to properly handle the compressed file. I'm not
much help beyond that. Perhaps you might try Pig (this is such a cool
platform!)

Hope it helps.

Best regards,

Danny

-Original Message-
From: Usman Waheed [mailto:usm...@opera.com] 
Sent: Wednesday, June 24, 2009 10:32 AM

To: core-user@hadoop.apache.org
Subject: Re: Are .bz2 extensions supported in Hadoop 18.3

Hi Danny,

Hmmm makes me wonder that i might be doing something wrong here. I 
imported just one .bz2 files into HDFS and then launched a map/reduce 
tasks executing the following command:


/home/hadoop/hadoop/bin/hadoop jar 
/home/hadoop/hadoop/contrib/streaming/hadoop-streaming.jar -input 
/user/hadoop/logs/2009/06/22/ -output /user/hadoop/out1 -mapper map.pl 
-file map.pl -reducer reduce.pl -file reduce.pl -jobconf 
mapred.reduce.tasks=1*


The .bz file was in the /user/hadoop/logs/2009/06/22 direcctory but the 
final output part-0 in /user/hadoop/out1 was meaningless. I was 
expecting key,value pairs but all i got was a count integer for example:


31,006, no errors were generated at all.

When i ran the same command above with uncompressed file my output was 
fine giving me the correct key,value pairs. No errors were generated.


Noted below is my map.pl and reduce.pl.

Thanks for your help,
Usman

_*map.pl*_

#!/usr/bin/perl -w
#
#

while (STDIN) {

chomp;
next if ( ! /^\d+/ );
my @fields = split(/;/);
my $cookie = $fields[11];
print $cookie\t1\n;

}


_*reduce.pl*_

#!/usr/bin/perl -w
#
#

while (STDIN) {

chomp;
($key,$value)  = split(/\t/);
$count{$key} += $value;
   
}


foreach $k (keys %count) {
$c = $count{$k};
print $k\t$c\n;
}



  

Hi Usman,

I'm running 0.18.3 from hadoop.apache.org, and have no issues with bz2
files.   My experiments with these files have been through Pig.  Hope
this is useful to you.

Best regards,

Danny Gross

-Original Message-
From: Usman Waheed [mailto:usm...@opera.com] 
Sent: Wednesday, June 24, 2009 10:09 AM

To: core-user@hadoop.apache.org
Subject: Re: Are .bz2 extensions supported in Hadoop 18.3

The version (18.3) i am running in my cluster is the tar ball i got


from
  

hadoop.apache.org.
So you are suggesting to use the Cloudera 18.3 which supports bzip2
correct?

Thanks,
Usman

  


I believe the cloudera 18.3 supports bzip2

On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com

  

wrote:
  

  

  

Hi All,

Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3?
I tried but interestingly the output was not what i expected versus
  


what i
  


got when my data was in uncompressed format.

Thanks,
Usman


  

  

  




Re: Are .bz2 extensions supported in Hadoop 18.3

2009-06-24 Thread Christophe Bisciglia
This is correct - thanks for the note Jason. You can see the current
patch list for Cloudera's Distribution (based on 18.3) at:
http://www.cloudera.com/hadoop-manifest

In addition to Bzip2, we have patched in: DBInputFormat, the fair
scheduler, job level task limiting, soft fd leak fix, a fix for HDFS
under-replication, shuffle improvements, EC2/S3 improvements, and
Sqoop - database import for Hadoop.

You can download RPMs and Ubuntu packages as well as preconfigured EC2
images from: http://www.cloudera.com/hadoop

Cheers,
Christophe

On Wed, Jun 24, 2009 at 6:47 AM, jason hadoopjason.had...@gmail.com wrote:
 I believe the cloudera 18.3 supports bzip2

 On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com wrote:

 Hi All,

 Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3?
 I tried but interestingly the output was not what i expected versus what i
 got when my data was in uncompressed format.

 Thanks,
 Usman




 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals




-- 
get hadoop: cloudera.com/hadoop
online training: cloudera.com/hadoop-training
blog: cloudera.com/blog
twitter: twitter.com/cloudera


Re: Are .bz2 extensions supported in Hadoop 18.3

2009-06-24 Thread Usman Waheed
Very cool, we are using Debian and I checked Cloudera's website. You have  
packages for the Debian platform.

Will check it out and install on a test cluster.

Thanks much,
Usman


This is correct - thanks for the note Jason. You can see the current
patch list for Cloudera's Distribution (based on 18.3) at:
http://www.cloudera.com/hadoop-manifest

In addition to Bzip2, we have patched in: DBInputFormat, the fair
scheduler, job level task limiting, soft fd leak fix, a fix for HDFS
under-replication, shuffle improvements, EC2/S3 improvements, and
Sqoop - database import for Hadoop.

You can download RPMs and Ubuntu packages as well as preconfigured EC2
images from: http://www.cloudera.com/hadoop

Cheers,
Christophe

On Wed, Jun 24, 2009 at 6:47 AM, jason hadoopjason.had...@gmail.com  
wrote:

I believe the cloudera 18.3 supports bzip2

On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com wrote:


Hi All,

Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3?
I tried but interestingly the output was not what i expected versus  
what i

got when my data was in uncompressed format.

Thanks,
Usman





--
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals









--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/