Re: Hadoop EC2
Hi Ryan, Just a heads up, if you require more than the 20 node limit, Amazon provides a form to request a higher limit: http://www.amazon.com/gp/html-forms-controller/ec2-request Andrew On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'm curious to see how many people are using EC2 to execute their Hadoop cluster and map/reduce programs, and how many are using home-grown datacenters. It seems like the 20 node limit with EC2 is a bit crippling when one wants to process many gigabytes of data. Has anyone found this to be the case? How much data are people processing with their 20 node limit on EC2? Curious what the thoughts are... Thanks, Ryan
Re: Hadoop EC2
I have been processing only 100s GBs on EC2, not 1000's and using 20 nodes and really only in exploration and testing phase right now. On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote: Hi Ryan, Just a heads up, if you require more than the 20 node limit, Amazon provides a form to request a higher limit: http://www.amazon.com/gp/html-forms-controller/ec2-request Andrew On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'm curious to see how many people are using EC2 to execute their Hadoop cluster and map/reduce programs, and how many are using home-grown datacenters. It seems like the 20 node limit with EC2 is a bit crippling when one wants to process many gigabytes of data. Has anyone found this to be the case? How much data are people processing with their 20 node limit on EC2? Curious what the thoughts are... Thanks, Ryan
Reading and writing Thrift data from MapReduce
We are already using Thrift to move and store our log data and I'm looking onto how I could read the stored log data into MapReduce processes. This article http://www.lexemetech.com/2008/07/rpc-and-serialization-with-hadoop.html talks about using Thrift for the IO, but it doesn't say anything specific. What's the current status of Thrift with Hadoop? Is there any documentation online or even some code in the SVN which I could look into? - Juho Mäkinen
Re: Hadoop EC2
Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. Thanks, Ryan On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote: I have been processing only 100s GBs on EC2, not 1000's and using 20 nodes and really only in exploration and testing phase right now. On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote: Hi Ryan, Just a heads up, if you require more than the 20 node limit, Amazon provides a form to request a higher limit: http://www.amazon.com/gp/html-forms-controller/ec2-request Andrew On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'm curious to see how many people are using EC2 to execute their Hadoop cluster and map/reduce programs, and how many are using home-grown datacenters. It seems like the 20 node limit with EC2 is a bit crippling when one wants to process many gigabytes of data. Has anyone found this to be the case? How much data are people processing with their 20 node limit on EC2? Curious what the thoughts are... Thanks, Ryan
Re: Hadoop EC2
Hi Ryan, I actually blogged my experience as it was my first usage of EC2: http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html My input data was not log files but actually a dump if 150million records from Mysql into about 13 columns of tab file data I believe. It was a couple of months ago, but I remember thinking S3 was very slow... I ran some simple operations like distinct values of one column based on another (species within a cell) and also did some Polygon analysis since to do is this point in this polygon does not really scale too well in PostGIS. Incidentally, I have most of the basics of a MapReduce-Lite which I aim to port to use the exact Hadoop API since I am *only* working on 10's-100's GB of data and find that it is running really fine on my laptop and I don't need the distributed failover. My goal for that code is for people like me who want to know that I can scale to terrabyte processing, but don't need to take the plunge to full Hadoop deployment yet, but will know that I can migrate the processing in the future as things grow. It runs on the normal filesystem, and single node only (e.g. multithreaded), and performs very quickly since it is just doing java NIO bytebuffers in parallel on the underlying filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a seconds (simplest of simple map operations). For these small datasets, you might find it useful - let me know if I should spend time finishing it (Or submit help?) - it is really very simple. Cheers Tim On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. Thanks, Ryan On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote: I have been processing only 100s GBs on EC2, not 1000's and using 20 nodes and really only in exploration and testing phase right now. On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote: Hi Ryan, Just a heads up, if you require more than the 20 node limit, Amazon provides a form to request a higher limit: http://www.amazon.com/gp/html-forms-controller/ec2-request Andrew On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'm curious to see how many people are using EC2 to execute their Hadoop cluster and map/reduce programs, and how many are using home-grown datacenters. It seems like the 20 node limit with EC2 is a bit crippling when one wants to process many gigabytes of data. Has anyone found this to be the case? How much data are people processing with their 20 node limit on EC2? Curious what the thoughts are... Thanks, Ryan
Re: Hadoop EC2
Hi Tim, Thanks for responding -- I believe that I'll need the full power of Hadoop since I'll want this to scale well beyond 100GB of data. Thanks for sharing your experiences -- I'll definitely check out your blog. Thanks! Ryan On Tue, Sep 2, 2008 at 8:47 AM, tim robertson [EMAIL PROTECTED] wrote: Hi Ryan, I actually blogged my experience as it was my first usage of EC2: http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html My input data was not log files but actually a dump if 150million records from Mysql into about 13 columns of tab file data I believe. It was a couple of months ago, but I remember thinking S3 was very slow... I ran some simple operations like distinct values of one column based on another (species within a cell) and also did some Polygon analysis since to do is this point in this polygon does not really scale too well in PostGIS. Incidentally, I have most of the basics of a MapReduce-Lite which I aim to port to use the exact Hadoop API since I am *only* working on 10's-100's GB of data and find that it is running really fine on my laptop and I don't need the distributed failover. My goal for that code is for people like me who want to know that I can scale to terrabyte processing, but don't need to take the plunge to full Hadoop deployment yet, but will know that I can migrate the processing in the future as things grow. It runs on the normal filesystem, and single node only (e.g. multithreaded), and performs very quickly since it is just doing java NIO bytebuffers in parallel on the underlying filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a seconds (simplest of simple map operations). For these small datasets, you might find it useful - let me know if I should spend time finishing it (Or submit help?) - it is really very simple. Cheers Tim On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. Thanks, Ryan On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote: I have been processing only 100s GBs on EC2, not 1000's and using 20 nodes and really only in exploration and testing phase right now. On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote: Hi Ryan, Just a heads up, if you require more than the 20 node limit, Amazon provides a form to request a higher limit: http://www.amazon.com/gp/html-forms-controller/ec2-request Andrew On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'm curious to see how many people are using EC2 to execute their Hadoop cluster and map/reduce programs, and how many are using home-grown datacenters. It seems like the 20 node limit with EC2 is a bit crippling when one wants to process many gigabytes of data. Has anyone found this to be the case? How much data are people processing with their 20 node limit on EC2? Curious what the thoughts are... Thanks, Ryan
Re: Reading and writing Thrift data from MapReduce
On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen [EMAIL PROTECTED] wrote: What's the current status of Thrift with Hadoop? Is there any documentation online or even some code in the SVN which I could look into? I think you have two choices: 1) wrap your Thrift code in a class that implements Writable, or 2) use Thrift to serialize your data to byte arrays and store them as BytesWritable. -Stuart
HDFS space utilization
Hi, I ran 3 data-nodes as HDFS and saw that at the beginning (no files in HDFS) I have only 15 GB instead 22 GB, see following live status of my nodes: *10 files and directories, 0 blocks = 10 total. Heap Size is 5.21 MB / 992.31 MB (0%) * Capacity : 22.64 GB DFS Remaining : 15.42 GB DFS Used : 72 KB DFS Used%:0 % Live Nodes http://192.168.18.83:50070/dfshealth.jsp#LiveNodes : 3 Dead Nodeshttp://192.168.18.83:50070/dfshealth.jsp#DeadNodes : 0 -- Live Datanodes : 3 Node Last Contact Admin State Size (GB) Used (%) Used (%) Remaining (GB) Blocks 192.168.18.114http://192.168.18.114:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F0In Service7.550 5.770 192.168.18.83http://192.168.18.83:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F0In Service7.550 5.850 192.168.18.85http://192.168.18.85:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F2In Service7.550 3.80 Could you explain me why such difference in capacity exists? Thanks, Victor Samoylov
Re: Error while uploading large file to S3 via Hadoop 0.18
On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello, I'm trying to upload a fairly large file (18GB or so) to my AWS S3 account via bin/hadoop fs -put ... s3://... Isn't the maximum size of a file on s3 5GB? -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: Error while uploading large file to S3 via Hadoop 0.18
Actually not if you're using the s3:// as opposed to s3n:// ... Thanks, Ryan On Tue, Sep 2, 2008 at 11:21 AM, James Moore [EMAIL PROTECTED] wrote: On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello, I'm trying to upload a fairly large file (18GB or so) to my AWS S3 account via bin/hadoop fs -put ... s3://... Isn't the maximum size of a file on s3 5GB? -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Slaves Hot-Swaping
Hi! I was wondering if there is a way to Hot-Swap Slave machines, for example, in case an Slave machine fails while the Cluster is running and I want to mount a new Slave machine to replace the old one, is there a way to tell the Master that a new Slave machine is Online without having to stop and start again the Cluster? I would appreciate the name of this, I don't think it is named Hot-Swaping, I don't know even if this exists. lol BTW, when I try to access http://wiki.apache.org/hadoop/NameNodeFailover the site tells me that the page doesn't exists. Is it a broken link? Any information is appreciated. Thanks in advance -- Camilo A. Gonzalez
Re: Slaves Hot-Swaping
On Tue, Sep 2, 2008 at 7:33 PM, Camilo Gonzalez wrote: I was wondering if there is a way to Hot-Swap Slave machines, for example, in case an Slave machine fails while the Cluster is running and I want to mount a new Slave machine to replace the old one, is there a way to tell the Master that a new Slave machine is Online without having to stop and start again the Cluster? You don't have to restart entire cluster, you just have to run datanode (DFS support) and/or tasktracker processes on fresh node. You can do it using hadoop-daemon.sh, commands are start datanode and start tasktracker respectively. There's no need for hot swapping and replacing old slave machines with new ones pretending to be old ones. You just plug new one in with new IP/hostname and it will eventually start to do tasks as all other nodes. You don't really need any hot standby or any other high-availability schemes. You just plug all possible slaves in and it will balance everything out. -- WBR, Mikhail Yakshin
Re: Hadoop EC2
tim robertson wrote: Incidentally, I have most of the basics of a MapReduce-Lite which I aim to port to use the exact Hadoop API since I am *only* working on 10's-100's GB of data and find that it is running really fine on my laptop and I don't need the distributed failover. My goal for that If it's going to be API-compatible with regular Hadoop, then I'm sure many people will find it useful. E.g. many Nutch users bemoan the complexity of distributed Hadoop setup, and they are not satisfied with the local single-threaded physical-copy execution mode. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Slaves Hot-Swaping
On 9/2/08 8:33 AM, Camilo Gonzalez [EMAIL PROTECTED] wrote: I was wondering if there is a way to Hot-Swap Slave machines, for example, in case an Slave machine fails while the Cluster is running and I want to mount a new Slave machine to replace the old one, is there a way to tell the Master that a new Slave machine is Online without having to stop and start again the Cluster? I would appreciate the name of this, I don't think it is named Hot-Swaping, I don't know even if this exists. Lol :) Using hadoop dfsadmin -refreshNodes, you can have the name node reload the include and exclude files.
Output directory already exists
Hi, I'm trying to write the output of two different map-reduce jobs into the same output directory. I'm using MultipleOutputFormats to set the filename dynamically, so there is no filename collision between the two jobs. However, I'm getting the error output directory already exists. Does the framework support this functionality? It seems silly to have to create a temp directory to store the output files from the second job and then have to copy them to the first job's output directory after the second job completes. Thanks, Shirley
Re: Hadoop EC2
Tom White's blog has a nice piece on the different setups you can have for a hadoop cluster on EC2: http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.html With the EBS volumes you can bring up and take down your cluster at will so you don't need to have 20 machines running all the time. We're still collecting performance numbers, but it's definitely faster to use EBS or local storage on EC2 than it is to use S3 (we were seeing 2Mb/s - 10Mb/s). M On Tue, Sep 2, 2008 at 8:59 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote: tim robertson wrote: Incidentally, I have most of the basics of a MapReduce-Lite which I aim to port to use the exact Hadoop API since I am *only* working on 10's-100's GB of data and find that it is running really fine on my laptop and I don't need the distributed failover. My goal for that If it's going to be API-compatible with regular Hadoop, then I'm sure many people will find it useful. E.g. many Nutch users bemoan the complexity of distributed Hadoop setup, and they are not satisfied with the local single-threaded physical-copy execution mode. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Hadoop EC2
How can you ensure that the S3 buckets and EC2 instances belong to a certain zone? Ryan On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson [EMAIL PROTECTED] wrote: On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote: Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. I'm seeing much faster speeds. With 128 nodes running a mapper-only downloading job, downloading 30 GB takes roughly a minute, less time than the end of job work (which I assume is HDFS replication and bookkeeping). More mappers gives you more parallel downloads, of course. I'm using a Python REST client for S3, and only move data to or from S3 when Hadoop is done with it. Make sure your S3 buckets and EC2 instances are in the same zone.
Re: Hadoop EC2
I assume that Karl means 'regions' - i.e. Europe or US. I don't think S3 has the same premise of availability zones that EC2 has. Between different regions, data transfer is 1) charged for and 2) likely slower between EC2 and S3-Europe. Transfer between S3-US and EC2 is free of charge, and should be significantly quicker. Russell Ryan LeCompte wrote: How can you ensure that the S3 buckets and EC2 instances belong to a certain zone? Ryan On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson [EMAIL PROTECTED] wrote: On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote: Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. I'm seeing much faster speeds. With 128 nodes running a mapper-only downloading job, downloading 30 GB takes roughly a minute, less time than the end of job work (which I assume is HDFS replication and bookkeeping). More mappers gives you more parallel downloads, of course. I'm using a Python REST client for S3, and only move data to or from S3 when Hadoop is done with it. Make sure your S3 buckets and EC2 instances are in the same zone.
Re: HDFS space utilization
Is there anything else on the partition DFS data directory is located? IOW, what does 'df -k' show when you run it under 'dfs.data.dir'? Raghu. Victor Samoylov wrote: Hi, I ran 3 data-nodes as HDFS and saw that at the beginning (no files in HDFS) I have only 15 GB instead 22 GB, see following live status of my nodes: *10 files and directories, 0 blocks = 10 total. Heap Size is 5.21 MB / 992.31 MB (0%) * Capacity : 22.64 GB DFS Remaining : 15.42 GB DFS Used : 72 KB DFS Used%:0 % Live Nodes http://192.168.18.83:50070/dfshealth.jsp#LiveNodes : 3 Dead Nodeshttp://192.168.18.83:50070/dfshealth.jsp#DeadNodes : 0 -- Live Datanodes : 3 Node Last Contact Admin State Size (GB) Used (%) Used (%) Remaining (GB) Blocks 192.168.18.114http://192.168.18.114:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F0In Service7.550 5.770 192.168.18.83http://192.168.18.83:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F0In Service7.550 5.850 192.168.18.85http://192.168.18.85:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F2In Service7.550 3.80 Could you explain me why such difference in capacity exists? Thanks, Victor Samoylov
Distributed Hadoop available on an OpenSolaris-based Live CD
Hello, I'd like to announce the availability of an open-source “Live CD” aimed at providing new users to Hadoop with a fully functional, pre- configured Hadoop cluster that is easy to start up and use and lets people get a quick look at what Hadoop offers in terms of power and ease of use. By lowering the barrier to getting Hadoop up and running, more people can try it out and explore its features. The CD image provided gives users an environment emulating a fully distributed, three-node virtual Hadoop cluster. One of the reasons we used OpenSolaris is its ability to emulate a multinode cluster environment in a very small memory foot print. A three-node Map/ Reduce cluster can be brought up on a machine with as little as 800 MB of memory. Each additional virtual cluster node only requires about 40 MB of additional memory, in addition to the memory used by Hadoop. This means that people can take Hadoop for a spin, even on their laptop. Since the CD is “live”, it does not modify the contents of the user's computer. This makes it ideal for those wishing to try out Hadoop without having to install any software. For example, students wishing to use Hadoop in a classroom lab environment can work entirely off of the CD. Included in this release is Hadoop 0.17.1 running on OpenSolaris. You can join the OpenSolaris Hadoop community online, as well as download the CD image, documentation, and other resources from http:// opensolaris.org/os/project/livehadoop/ If you have any requests or suggestions for improvements to this distribution of Hadoop, please let us know through the community site, or join the discussion by sending an email to [EMAIL PROTECTED] Thanks, George Porter
Re: Output directory already exists
On Wed, Sep 3, 2008 at 1:24 AM, Shirley Cohen [EMAIL PROTECTED] wrote: Hi, I'm trying to write the output of two different map-reduce jobs into the same output directory. I'm using MultipleOutputFormats to set the filename dynamically, so there is no filename collision between the two jobs. However, I'm getting the error output directory already exists. Does the framework support this functionality? It seems silly to have to create a temp directory to store the output files from the second job and then have to copy them to the first job's output directory after the second job completes. Map/reduce will create output directory every time it runs and will fail if the directory exists. Seems that there is no way to implement your description other than modify the source code. Thanks, Shirley -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: Output directory already exists
On Tue, Sep 2, 2008 at 10:24 AM, Shirley Cohen [EMAIL PROTECTED] wrote: Hi, I'm trying to write the output of two different map-reduce jobs into the same output directory. I'm using MultipleOutputFormats to set the filename dynamically, so there is no filename collision between the two jobs. However, I'm getting the error output directory already exists. You just need to define a new OutputFormat that derives from the one that you are really using for the second job. For example, if your second job is using TextOutputFormat, you could derive a subtype and have it always return from checkOutputSpec, even if the directory already exists. Something like: {code} public class NoClobberTextOutputFormat extends TextOutputFormat { RecordWriterK, V getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) throws IOException { return super(ignored, job, name + -second, progress); } public void checkOutputSpecs(FileSystem fs, JobConf conf) { } } {code} -- Owen
Re: Distributed Hadoop available on an OpenSolaris-based Live CD
Another good idea would be to create a VM image with Hadoop installed and configured for a single-node cluster. Just throwing that out there. Alex On Wed, Sep 3, 2008 at 8:37 AM, George Porter [EMAIL PROTECTED] wrote: Hello, I'd like to announce the availability of an open-source Live CD aimed at providing new users to Hadoop with a fully functional, pre-configured Hadoop cluster that is easy to start up and use and lets people get a quick look at what Hadoop offers in terms of power and ease of use. By lowering the barrier to getting Hadoop up and running, more people can try it out and explore its features. The CD image provided gives users an environment emulating a fully distributed, three-node virtual Hadoop cluster. One of the reasons we used OpenSolaris is its ability to emulate a multinode cluster environment in a very small memory foot print. A three-node Map/Reduce cluster can be brought up on a machine with as little as 800 MB of memory. Each additional virtual cluster node only requires about 40 MB of additional memory, in addition to the memory used by Hadoop. This means that people can take Hadoop for a spin, even on their laptop. Since the CD is live, it does not modify the contents of the user's computer. This makes it ideal for those wishing to try out Hadoop without having to install any software. For example, students wishing to use Hadoop in a classroom lab environment can work entirely off of the CD. Included in this release is Hadoop 0.17.1 running on OpenSolaris. You can join the OpenSolaris Hadoop community online, as well as download the CD image, documentation, and other resources from http:// opensolaris.org/os/project/livehadoop/ If you have any requests or suggestions for improvements to this distribution of Hadoop, please let us know through the community site, or join the discussion by sending an email to [EMAIL PROTECTED] Thanks, George Porter
Re: JVM Spawning
I see... so there really isn't a way for me to test a map/reduce program using a single node without incurring the overhead of upping/downing JVM's... My input is broken up into 5 text files is there a way I could start the job such that it only uses 1 map to process the whole thing? I guess I'd have to concatenate the files into 1 file and somehow turn off splitting? Ryan On Wed, Sep 3, 2008 at 12:09 AM, Owen O'Malley [EMAIL PROTECTED] wrote: On Sep 2, 2008, at 9:00 PM, Ryan LeCompte wrote: Beginner's question: If I have a cluster with a single node that has a max of 1 map/1 reduce, and the job submitted has 50 maps... Then it will process only 1 map at a time. Does that mean that it's spawning 1 new JVM for each map processed? Or re-using the same JVM when a new map can be processed? It creates a new JVM for each task. Devaraj is working on https://issues.apache.org/jira/browse/HADOOP-249 which will allow the jvms to run multiple tasks sequentially. -- Owen
Re: JVM Spawning
On Tue, Sep 2, 2008 at 9:13 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: I see... so there really isn't a way for me to test a map/reduce program using a single node without incurring the overhead of upping/downing JVM's... My input is broken up into 5 text files is there a way I could start the job such that it only uses 1 map to process the whole thing? I guess I'd have to concatenate the files into 1 file and somehow turn off splitting? There is a MultipleFileInputFormat, but it is less useful than it should be, but it is a good place to start. If you defining a MultipleFileInputFormat that reads text files should be pretty easy and it will give you a single map for your job. Otherwise, yes, you'll need to make a single file and ask for a single map. -- Owen
Re: JVM Spawning
I posted an idea for an extension for MultipleFileInputFormat if someone has any extra time. *smile* https://issues.apache.org/jira/browse/HADOOP-4057 -- Owen