Re: Hadoop EC2

2008-09-02 Thread Andrew Hitchcock
Hi Ryan,

Just a heads up, if you require more than the 20 node limit, Amazon
provides a form to request a higher limit:

http://www.amazon.com/gp/html-forms-controller/ec2-request

Andrew

On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan



Re: Hadoop EC2

2008-09-02 Thread tim robertson
I have been processing only 100s GBs on EC2, not 1000's and using 20
nodes and really only in exploration and testing phase right now.


On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan




Reading and writing Thrift data from MapReduce

2008-09-02 Thread Juho Mäkinen
We are already using Thrift to move and store our log data and I'm
looking onto how I could read the stored log data into MapReduce
processes. This article
http://www.lexemetech.com/2008/07/rpc-and-serialization-with-hadoop.html
talks about using Thrift for the IO, but it doesn't say anything
specific.

What's the current status of Thrift with Hadoop? Is there any
documentation online or even some code in the SVN which I could look
into?

 - Juho Mäkinen


Re: Hadoop EC2

2008-09-02 Thread Ryan LeCompte
Hi Tim,

Are you mostly just processing/parsing textual log files? How many
maps/reduces did you configure in your hadoop-ec2-env.sh file? How
many did you configure in your JobConf? Just trying to get an idea of
what to expect in terms of performance. I'm noticing that it takes
about 16 minutes to transfer about 15GB of textual uncompressed data
from S3 into HDFS after the cluster has started with 15 nodes. I was
expecting this to take a shorter amount of time, but maybe I'm
incorrect in my assumptions. I am also noticing that it takes about 15
minutes to parse through the 15GB of data with a 15 node cluster.

Thanks,
Ryan


On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan





Re: Hadoop EC2

2008-09-02 Thread tim robertson
Hi Ryan,

I actually blogged my experience as it was my first usage of EC2:
http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html

My input data was not log files but actually a dump if 150million
records from Mysql into about 13 columns of tab file data I believe.
It was a couple of months ago, but I remember thinking S3 was very slow...

I ran some simple operations like distinct values of one column based
on another (species within a cell) and also did some Polygon analysis
since to do is this point in this polygon does not really scale too
well in PostGIS.

Incidentally, I have most of the basics of a MapReduce-Lite which I
aim to port to use the exact Hadoop API since I am *only* working on
10's-100's GB of data and find that it is running really fine on my
laptop and I don't need the distributed failover.  My goal for that
code is for people like me who want to know that I can scale to
terrabyte processing, but don't need to take the plunge to full Hadoop
deployment yet, but will know that I can migrate the processing in the
future as  things grow.  It runs on the normal filesystem, and single
node only (e.g. multithreaded), and performs very quickly since it is
just doing java NIO bytebuffers in parallel on the underlying
filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a
seconds (simplest of simple map operations).  For these small
datasets, you might find it useful - let me know if I should spend
time finishing it (Or submit help?) - it is really very simple.

Cheers

Tim



On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 Thanks,
 Ryan


 On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan






Re: Hadoop EC2

2008-09-02 Thread Ryan LeCompte
Hi Tim,

Thanks for responding -- I believe that I'll need the full power of
Hadoop since I'll want this to scale well beyond 100GB of data. Thanks
for sharing your experiences -- I'll definitely check out your blog.

Thanks!

Ryan


On Tue, Sep 2, 2008 at 8:47 AM, tim robertson [EMAIL PROTECTED] wrote:
 Hi Ryan,

 I actually blogged my experience as it was my first usage of EC2:
 http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html

 My input data was not log files but actually a dump if 150million
 records from Mysql into about 13 columns of tab file data I believe.
 It was a couple of months ago, but I remember thinking S3 was very slow...

 I ran some simple operations like distinct values of one column based
 on another (species within a cell) and also did some Polygon analysis
 since to do is this point in this polygon does not really scale too
 well in PostGIS.

 Incidentally, I have most of the basics of a MapReduce-Lite which I
 aim to port to use the exact Hadoop API since I am *only* working on
 10's-100's GB of data and find that it is running really fine on my
 laptop and I don't need the distributed failover.  My goal for that
 code is for people like me who want to know that I can scale to
 terrabyte processing, but don't need to take the plunge to full Hadoop
 deployment yet, but will know that I can migrate the processing in the
 future as  things grow.  It runs on the normal filesystem, and single
 node only (e.g. multithreaded), and performs very quickly since it is
 just doing java NIO bytebuffers in parallel on the underlying
 filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a
 seconds (simplest of simple map operations).  For these small
 datasets, you might find it useful - let me know if I should spend
 time finishing it (Or submit help?) - it is really very simple.

 Cheers

 Tim



 On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 Thanks,
 Ryan


 On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan







Re: Reading and writing Thrift data from MapReduce

2008-09-02 Thread Stuart Sierra
On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen [EMAIL PROTECTED] wrote:
 What's the current status of Thrift with Hadoop? Is there any
 documentation online or even some code in the SVN which I could look
 into?

I think you have two choices: 1) wrap your Thrift code in a class that
implements Writable, or 2) use Thrift to serialize your data to byte
arrays and store them as BytesWritable.
-Stuart


HDFS space utilization

2008-09-02 Thread Victor Samoylov
Hi,

I ran 3 data-nodes as HDFS and saw that at the beginning (no files in HDFS)
I have only 15 GB instead 22 GB, see following live status of my nodes:

*10 files and directories, 0 blocks = 10 total. Heap Size is 5.21 MB /
992.31 MB (0%)
*   Capacity : 22.64 GB DFS Remaining : 15.42 GB DFS Used : 72 KB DFS
Used%:0 % Live
Nodes http://192.168.18.83:50070/dfshealth.jsp#LiveNodes : 3 Dead
Nodeshttp://192.168.18.83:50070/dfshealth.jsp#DeadNodes
: 0

--
 Live Datanodes : 3

  Node Last Contact Admin State Size (GB) Used (%) Used (%) Remaining
(GB) Blocks
192.168.18.114http://192.168.18.114:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F0In
Service7.550
5.770 
192.168.18.83http://192.168.18.83:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F0In
Service7.550
5.850 
192.168.18.85http://192.168.18.85:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F2In
Service7.550
3.80

Could you explain me why such difference in capacity exists?

Thanks,
Victor Samoylov


Re: Error while uploading large file to S3 via Hadoop 0.18

2008-09-02 Thread James Moore
On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello,

 I'm trying to upload a fairly large file (18GB or so) to my AWS S3
 account via bin/hadoop fs -put ... s3://...

Isn't the maximum size of a file on s3 5GB?

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com


Re: Error while uploading large file to S3 via Hadoop 0.18

2008-09-02 Thread Ryan LeCompte
Actually not if you're using the s3:// as opposed to s3n:// ...

Thanks,
Ryan


On Tue, Sep 2, 2008 at 11:21 AM, James Moore [EMAIL PROTECTED] wrote:
 On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello,

 I'm trying to upload a fairly large file (18GB or so) to my AWS S3
 account via bin/hadoop fs -put ... s3://...

 Isn't the maximum size of a file on s3 5GB?

 --
 James Moore | [EMAIL PROTECTED]
 Ruby and Ruby on Rails consulting
 blog.restphone.com



Slaves Hot-Swaping

2008-09-02 Thread Camilo Gonzalez
Hi!

I was wondering if there is a way to Hot-Swap Slave machines, for example,
in case an Slave machine fails while the Cluster is running and I want to
mount a new Slave machine to replace the old one, is there a way to tell the
Master that a new Slave machine is Online without having to stop and start
again the Cluster? I would appreciate the name of this, I don't think it is
named Hot-Swaping, I don't know even if this exists. lol

BTW, when I try to access http://wiki.apache.org/hadoop/NameNodeFailover the
site tells me that the page doesn't exists. Is it a broken link?

Any information is appreciated.

Thanks in advance

-- 
Camilo A. Gonzalez


Re: Slaves Hot-Swaping

2008-09-02 Thread Mikhail Yakshin
On Tue, Sep 2, 2008 at 7:33 PM, Camilo Gonzalez wrote:
 I was wondering if there is a way to Hot-Swap Slave machines, for example,
 in case an Slave machine fails while the Cluster is running and I want to
 mount a new Slave machine to replace the old one, is there a way to tell the
 Master that a new Slave machine is Online without having to stop and start
 again the Cluster?

You don't have to restart entire cluster, you just have to run
datanode (DFS support) and/or tasktracker processes on fresh node. You
can do it using hadoop-daemon.sh, commands are start datanode and
start tasktracker respectively. There's no need for hot swapping
and replacing old slave machines with new ones pretending to be old
ones. You just plug new one in with new IP/hostname and it will
eventually start to do tasks as all other nodes.

You don't really need any hot standby or any other high-availability
schemes. You just plug all possible slaves in and it will balance
everything out.

-- 
WBR, Mikhail Yakshin


Re: Hadoop EC2

2008-09-02 Thread Andrzej Bialecki

tim robertson wrote:


Incidentally, I have most of the basics of a MapReduce-Lite which I
aim to port to use the exact Hadoop API since I am *only* working on
10's-100's GB of data and find that it is running really fine on my
laptop and I don't need the distributed failover.  My goal for that


If it's going to be API-compatible with regular Hadoop, then I'm sure 
many people will find it useful. E.g. many Nutch users bemoan the 
complexity of distributed Hadoop setup, and they are not satisfied with 
the local single-threaded physical-copy execution mode.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Slaves Hot-Swaping

2008-09-02 Thread Allen Wittenauer



On 9/2/08 8:33 AM, Camilo Gonzalez [EMAIL PROTECTED] wrote:

 I was wondering if there is a way to Hot-Swap Slave machines, for example,
 in case an Slave machine fails while the Cluster is running and I want to
 mount a new Slave machine to replace the old one, is there a way to tell the
 Master that a new Slave machine is Online without having to stop and start
 again the Cluster? I would appreciate the name of this, I don't think it is
 named Hot-Swaping, I don't know even if this exists. Lol


:)

Using hadoop dfsadmin -refreshNodes, you can have the name node reload
the include and exclude files.



Output directory already exists

2008-09-02 Thread Shirley Cohen

Hi,

I'm trying to write the output of two different map-reduce jobs into  
the same output directory. I'm using MultipleOutputFormats to set the  
filename dynamically, so there is no filename collision between the  
two jobs. However, I'm getting the error output directory already  
exists.


Does the framework support this functionality? It seems silly to have  
to create a temp directory to store the output files from the second  
job and then have to copy them to the first job's output directory  
after the second job completes.


Thanks,

Shirley



Re: Hadoop EC2

2008-09-02 Thread Michael Stoppelman
Tom White's blog has a nice piece on the different setups you can have for a
hadoop cluster on EC2:
http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.html

With the EBS volumes you can bring up and take down your cluster at will so
you don't need to have 20 machines running all the time. We're still
collecting performance numbers, but it's definitely faster to use EBS or
local storage on EC2 than it is to use S3 (we were seeing 2Mb/s - 10Mb/s).

M

On Tue, Sep 2, 2008 at 8:59 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 tim robertson wrote:

  Incidentally, I have most of the basics of a MapReduce-Lite which I
 aim to port to use the exact Hadoop API since I am *only* working on
 10's-100's GB of data and find that it is running really fine on my
 laptop and I don't need the distributed failover.  My goal for that


 If it's going to be API-compatible with regular Hadoop, then I'm sure many
 people will find it useful. E.g. many Nutch users bemoan the complexity of
 distributed Hadoop setup, and they are not satisfied with the local
 single-threaded physical-copy execution mode.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




Re: Hadoop EC2

2008-09-02 Thread Ryan LeCompte
How can you ensure that the S3 buckets and EC2 instances belong to a
certain zone?

Ryan


On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson [EMAIL PROTECTED] wrote:

 On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:

 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 I'm seeing much faster speeds.  With 128 nodes running a mapper-only
 downloading job, downloading 30 GB takes roughly a minute, less time than
 the end of job work (which I assume is HDFS replication and bookkeeping).
  More mappers gives you more parallel downloads, of course.  I'm using a
 Python REST client for S3, and only move data to or from S3 when Hadoop is
 done with it.

 Make sure your S3 buckets and EC2 instances are in the same zone.




Re: Hadoop EC2

2008-09-02 Thread Russell Smith
I assume that Karl means 'regions' - i.e. Europe or US. I don't think S3 
has the same premise of availability zones that EC2 has.


Between different regions, data transfer is 1) charged for and 2) likely 
slower between EC2 and S3-Europe.


Transfer between S3-US and EC2 is free of charge, and should be 
significantly quicker.



Russell

Ryan LeCompte wrote:

How can you ensure that the S3 buckets and EC2 instances belong to a
certain zone?

Ryan


On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson [EMAIL PROTECTED] wrote:
  

On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:



Hi Tim,

Are you mostly just processing/parsing textual log files? How many
maps/reduces did you configure in your hadoop-ec2-env.sh file? How
many did you configure in your JobConf? Just trying to get an idea of
what to expect in terms of performance. I'm noticing that it takes
about 16 minutes to transfer about 15GB of textual uncompressed data
from S3 into HDFS after the cluster has started with 15 nodes. I was
expecting this to take a shorter amount of time, but maybe I'm
incorrect in my assumptions. I am also noticing that it takes about 15
minutes to parse through the 15GB of data with a 15 node cluster.
  

I'm seeing much faster speeds.  With 128 nodes running a mapper-only
downloading job, downloading 30 GB takes roughly a minute, less time than
the end of job work (which I assume is HDFS replication and bookkeeping).
 More mappers gives you more parallel downloads, of course.  I'm using a
Python REST client for S3, and only move data to or from S3 when Hadoop is
done with it.

Make sure your S3 buckets and EC2 instances are in the same zone.







Re: HDFS space utilization

2008-09-02 Thread Raghu Angadi


Is there anything else on the partition DFS data directory is located?
IOW, what does 'df -k' show when you run it under 'dfs.data.dir'?

Raghu.

Victor Samoylov wrote:

Hi,

I ran 3 data-nodes as HDFS and saw that at the beginning (no files in HDFS)
I have only 15 GB instead 22 GB, see following live status of my nodes:

*10 files and directories, 0 blocks = 10 total. Heap Size is 5.21 MB /
992.31 MB (0%)
*   Capacity : 22.64 GB DFS Remaining : 15.42 GB DFS Used : 72 KB DFS
Used%:0 % Live
Nodes http://192.168.18.83:50070/dfshealth.jsp#LiveNodes : 3 Dead
Nodeshttp://192.168.18.83:50070/dfshealth.jsp#DeadNodes
: 0

--
 Live Datanodes : 3

  Node Last Contact Admin State Size (GB) Used (%) Used (%) Remaining
(GB) Blocks
192.168.18.114http://192.168.18.114:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F0In
Service7.550
5.770 
192.168.18.83http://192.168.18.83:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F0In
Service7.550
5.850 
192.168.18.85http://192.168.18.85:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F2In
Service7.550
3.80

Could you explain me why such difference in capacity exists?

Thanks,
Victor Samoylov





Distributed Hadoop available on an OpenSolaris-based Live CD

2008-09-02 Thread George Porter

Hello,

I'd like to announce the availability of an open-source “Live CD”  
aimed at providing new users to Hadoop with a fully functional, pre- 
configured Hadoop cluster that is easy to start up and use and lets  
people get a quick look at what Hadoop offers in terms of power and  
ease of use. By lowering the barrier to getting Hadoop up and  
running, more people can try it out and explore its features.


The CD image provided gives users an environment emulating a fully  
distributed, three-node virtual Hadoop cluster. One of the reasons we  
used OpenSolaris is its ability to emulate a multinode cluster  
environment in a very small memory foot print. A three-node Map/ 
Reduce cluster can be brought up on a machine with as little as 800  
MB of memory. Each additional virtual cluster node only requires  
about 40 MB of additional memory, in addition to the memory used by  
Hadoop. This means that people can take Hadoop for a spin, even on  
their laptop.


Since the CD is “live”, it does not modify the contents of the user's  
computer. This makes it ideal for those wishing to try out Hadoop  
without having to install any software. For example, students wishing  
to use Hadoop in a classroom lab environment can work entirely off of  
the CD.


Included in this release is Hadoop 0.17.1 running on OpenSolaris. You  
can join the OpenSolaris Hadoop community online, as well as download  
the CD image, documentation, and other resources from http:// 
opensolaris.org/os/project/livehadoop/ If you have any requests or  
suggestions for improvements to this distribution of Hadoop, please  
let us know through the community site, or join the discussion by  
sending an email to [EMAIL PROTECTED]


Thanks,
George Porter


Re: Output directory already exists

2008-09-02 Thread Mafish Liu
On Wed, Sep 3, 2008 at 1:24 AM, Shirley Cohen [EMAIL PROTECTED] wrote:

 Hi,

 I'm trying to write the output of two different map-reduce jobs into the
 same output directory. I'm using MultipleOutputFormats to set the filename
 dynamically, so there is no filename collision between the two jobs.
 However, I'm getting the error output directory already exists.

 Does the framework support this functionality? It seems silly to have to
 create a temp directory to store the output files from the second job and
 then have to copy them to the first job's output directory after the second
 job completes.


Map/reduce will create output directory every time it runs and will fail if
the directory exists. Seems that there is no way to implement your
description other than modify the source code.



 Thanks,

 Shirley




-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: Output directory already exists

2008-09-02 Thread Owen O'Malley
On Tue, Sep 2, 2008 at 10:24 AM, Shirley Cohen [EMAIL PROTECTED] wrote:

 Hi,

 I'm trying to write the output of two different map-reduce jobs into the
 same output directory. I'm using MultipleOutputFormats to set the filename
 dynamically, so there is no filename collision between the two jobs.
 However, I'm getting the error output directory already exists.


You just need to define a new OutputFormat that derives from the one that
you are really using for the second job. For example, if your second job is
using TextOutputFormat, you could derive a subtype and have it always return
from checkOutputSpec, even if the directory already exists. Something like:

{code}
public class NoClobberTextOutputFormat extends TextOutputFormat {
  RecordWriterK, V getRecordWriter(FileSystem ignored, JobConf job,
 String name, Progressable progress)
throws IOException {
 return super(ignored, job, name + -second, progress);
  }
  public void checkOutputSpecs(FileSystem fs, JobConf conf) { }
}
{code}

-- Owen


Re: Distributed Hadoop available on an OpenSolaris-based Live CD

2008-09-02 Thread Alex Loddengaard
Another good idea would be to create a VM image with Hadoop installed and
configured for a single-node cluster.  Just throwing that out there.

Alex

On Wed, Sep 3, 2008 at 8:37 AM, George Porter [EMAIL PROTECTED] wrote:

 Hello,

 I'd like to announce the availability of an open-source Live CD aimed at
 providing new users to Hadoop with a fully functional, pre-configured Hadoop
 cluster that is easy to start up and use and lets people get a quick look at
 what Hadoop offers in terms of power and ease of use. By lowering the
 barrier to getting Hadoop up and running, more people can try it out and
 explore its features.

 The CD image provided gives users an environment emulating a fully
 distributed, three-node virtual Hadoop cluster. One of the reasons we used
 OpenSolaris is its ability to emulate a multinode cluster environment in a
 very small memory foot print. A three-node Map/Reduce cluster can be brought
 up on a machine with as little as 800 MB of memory. Each additional virtual
 cluster node only requires about 40 MB of additional memory, in addition to
 the memory used by Hadoop. This means that people can take Hadoop for a
 spin, even on their laptop.

 Since the CD is live, it does not modify the contents of the user's
 computer. This makes it ideal for those wishing to try out Hadoop without
 having to install any software. For example, students wishing to use Hadoop
 in a classroom lab environment can work entirely off of the CD.

 Included in this release is Hadoop 0.17.1 running on OpenSolaris. You can
 join the OpenSolaris Hadoop community online, as well as download the CD
 image, documentation, and other resources from http://
 opensolaris.org/os/project/livehadoop/ If you have any requests or
 suggestions for improvements to this distribution of Hadoop, please let us
 know through the community site, or join the discussion by sending an email
 to [EMAIL PROTECTED]

 Thanks,
 George Porter



Re: JVM Spawning

2008-09-02 Thread Ryan LeCompte
I see... so there really isn't a way for me to test a map/reduce
program using a single node without incurring the overhead of
upping/downing JVM's... My input is broken up into 5 text files is
there a way I could start the job such that it only uses 1 map to
process the whole thing? I guess I'd have to concatenate the files
into 1 file and somehow turn off splitting?

Ryan


On Wed, Sep 3, 2008 at 12:09 AM, Owen O'Malley [EMAIL PROTECTED] wrote:

 On Sep 2, 2008, at 9:00 PM, Ryan LeCompte wrote:

 Beginner's question:

 If I have a cluster with a single node that has a max of 1 map/1
 reduce, and the job submitted has 50 maps... Then it will process only
 1 map at a time. Does that mean that it's spawning 1 new JVM for each
 map processed? Or re-using the same JVM when a new map can be
 processed?

 It creates a new JVM for each task. Devaraj is working on
 https://issues.apache.org/jira/browse/HADOOP-249
 which will allow the jvms to run multiple tasks sequentially.

 -- Owen



Re: JVM Spawning

2008-09-02 Thread Owen O'Malley
On Tue, Sep 2, 2008 at 9:13 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:

 I see... so there really isn't a way for me to test a map/reduce
 program using a single node without incurring the overhead of
 upping/downing JVM's... My input is broken up into 5 text files is
 there a way I could start the job such that it only uses 1 map to
 process the whole thing? I guess I'd have to concatenate the files
 into 1 file and somehow turn off splitting?


There is a MultipleFileInputFormat, but it is less useful than it should be,
but it is a good
place to start. If you defining a MultipleFileInputFormat that reads text
files should be pretty easy and it will give you a single map for your job.
Otherwise, yes, you'll need to make a single file and ask for a single map.

-- Owen


Re: JVM Spawning

2008-09-02 Thread Owen O'Malley
I posted an idea for an extension for MultipleFileInputFormat if someone has
any extra time. *smile*

https://issues.apache.org/jira/browse/HADOOP-4057

-- Owen