Re: Question about Hadoop
Thank you very much for explaining it to me, Ted.. Thats a great deal of info! I guess that could be how Yahoo Webmap is designed.. And for anyone trying to figure out the massiveness of Hadoop computing, http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/should give a good picture of a practical case. I was for a moment flabbergasted, and instantly fell in love with Hadoop! ;) On Sat, Jun 14, 2008 at 12:11 AM, Ted Dunning [EMAIL PROTECTED] wrote: Usually hadoop programs are not used interactively since what they excel at is batch operations on very large collections of data. It is quite reasonable to store resulting data in hadoop and access those results using hadoop. The cleanest way to do that is to have a presentation layer web server that has all of the UI on it and use http to access the results file from hadoop via the namenodes data access URL. This works well where the results are not particularly voluminous. For large quantities of data such as the output of a web-crawl, it is usually better to copy the output out of hadoop and into a clustered system that supports high speed querying of the data. This clustered system might be as simple as a redundant memcache or mySql farm or as fancy as a sharded and replicated farm of text retrieval engines running under Solr. What works for you will vary by what you need to do. You should keep in mind that hadoop was designed for very long MTBF (for a cluster), but not designed for zero downtime operation. At the very least, you will occasionally want to upgrade the cluster software and that currently can't be done during normal operations. Combining hadoop (for heavy duty computations) with a separate persistence layer (for high availability web service) is a good hybrid. On Thu, Jun 12, 2008 at 9:53 PM, Chanchal James [EMAIL PROTECTED] wrote: Thank you all for the responses. So in order to run a web-based application, I just need to put the part of the application that needs to make use of distributed computation in HDFS, and have the other web site related files access it via Hadoop streaming ? Is that how Hadoop is used ? Sorry the question may sound too silly. Thank you. -- ted
Failed Reduce Task
Hey everyone, I'm trying to get the hang of using Hadoop and I'm using the Michael Noll Ubuntu tutorials (http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)). Using the wordcount example that comes with version 0.17.1-dev I get this error output: 08/06/14 15:17:45 INFO mapred.FileInputFormat: Total input paths to process : 6 08/06/14 15:17:46 INFO mapred.JobClient: Running job: job_200806141506_0003 08/06/14 15:17:47 INFO mapred.JobClient: map 0% reduce 0% 08/06/14 15:17:53 INFO mapred.JobClient: map 12% reduce 0% 08/06/14 15:17:54 INFO mapred.JobClient: map 25% reduce 0% 08/06/14 15:17:55 INFO mapred.JobClient: map 37% reduce 0% 08/06/14 15:17:57 INFO mapred.JobClient: map 50% reduce 0% 08/06/14 15:17:58 INFO mapred.JobClient: map 75% reduce 0% 08/06/14 15:18:00 INFO mapred.JobClient: map 100% reduce 0% 08/06/14 15:18:03 INFO mapred.JobClient: map 100% reduce 1% 08/06/14 15:18:09 INFO mapred.JobClient: map 100% reduce 13% 08/06/14 15:18:16 INFO mapred.JobClient: map 100% reduce 18% 08/06/14 15:20:49 INFO mapred.JobClient: Task Id : task_200806141506_0003_m_01_0, Status : FAILED Too many fetch-failures 08/06/14 15:20:51 INFO mapred.JobClient: map 87% reduce 18% 08/06/14 15:20:52 INFO mapred.JobClient: map 100% reduce 18% 08/06/14 15:20:56 INFO mapred.JobClient: map 100% reduce 19% 08/06/14 15:21:01 INFO mapred.JobClient: map 100% reduce 20% 08/06/14 15:21:05 INFO mapred.JobClient: map 100% reduce 16% 08/06/14 15:21:05 INFO mapred.JobClient: Task Id : task_200806141506_0003_r_01_0, Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. This is with 2 nodes (master and slave) using the default values in /hadoop/conf/hadoop-default.xml and then increasing the number of reduce tasks to 3 and 5 to see if this changed anything (which it didn't). I'm wondering if anybody had this type of problem before and how to fix it? Thanks for any help. -Chanel
Guide to running Hadoop on Windows
I recently got started using Hadoop and spent some time getting distributed Hadoop running on Windows. I didn't find much on Google about running on Windows and the docs are a tad vague on the subject. Anyway, for anyone that's interested, I've written up a guide based on my experience called Running Hadoop on Windows at http://hayesdavis.net/2008/06/14/running-hadoop-on-windows/ Hopefully I didn't duplicate work someone's done elsewhere and hopefully this will help anyone out there wanting to get started with distributed Hadoop on Windows. If anyone finds any issues in the guide or has questions, please let me know or just comment on the blog post. Thanks. Hayes -- Hayes Davis [EMAIL PROTECTED] http://hayesdavis.net
Ec2 and MR Job question
I have a question someone may have answered here before but I can not find the answer. Assuming I have a cluster of servers hosting a large amount of data I want to run a large job that the maps take a lot of cpu power to run and the reduces only take a small amount cpu to run. I want to run the maps on a group of EC2 servers and run the reduces on the local cluster of 10 machines. The problem I am seeing is the map outputs, if I run the maps on EC2 they are stored local on the instance What I am looking to do is have the map output files stored in hdfs so I can kill the EC2 instances sense I do not need them for the reduces. The only way I can thank to do this is run two jobs one maper and store the output on hdfs and then run a second job to run the reduces from the map outputs store on the hfds. Is there away to make the mappers store the final output in hdfs?
Re: Ec2 and MR Job question
well, to answer your last question first, just set the # reducers to zero. but you can't just run reducers without mappers (as far as I know, having never tried). so your local job will need to run identity mappers in order to feed your reducers. http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html ckw On Jun 14, 2008, at 1:31 PM, Billy Pearson wrote: I have a question someone may have answered here before but I can not find the answer. Assuming I have a cluster of servers hosting a large amount of data I want to run a large job that the maps take a lot of cpu power to run and the reduces only take a small amount cpu to run. I want to run the maps on a group of EC2 servers and run the reduces on the local cluster of 10 machines. The problem I am seeing is the map outputs, if I run the maps on EC2 they are stored local on the instance What I am looking to do is have the map output files stored in hdfs so I can kill the EC2 instances sense I do not need them for the reduces. The only way I can thank to do this is run two jobs one maper and store the output on hdfs and then run a second job to run the reduces from the map outputs store on the hfds. Is there away to make the mappers store the final output in hdfs? -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Ec2 and MR Job question
I understand how to run it as two jobs my only question is Is there away to make the mappers store the final output in hdfs? so I can kill the ec2 machines without waiting to the reduce stage ends! Billy Chris K Wensel [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] well, to answer your last question first, just set the # reducers to zero. but you can't just run reducers without mappers (as far as I know, having never tried). so your local job will need to run identity mappers in order to feed your reducers. http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html ckw On Jun 14, 2008, at 1:31 PM, Billy Pearson wrote: I have a question someone may have answered here before but I can not find the answer. Assuming I have a cluster of servers hosting a large amount of data I want to run a large job that the maps take a lot of cpu power to run and the reduces only take a small amount cpu to run. I want to run the maps on a group of EC2 servers and run the reduces on the local cluster of 10 machines. The problem I am seeing is the map outputs, if I run the maps on EC2 they are stored local on the instance What I am looking to do is have the map output files stored in hdfs so I can kill the EC2 instances sense I do not need them for the reduces. The only way I can thank to do this is run two jobs one maper and store the output on hdfs and then run a second job to run the reduces from the map outputs store on the hfds. Is there away to make the mappers store the final output in hdfs? -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Ec2 and MR Job question
My second question is about the ec2 machines has anyone solved the hostname problem in a automated way? Example if I launch a ec2 server to run a task tracker the hostname reported back to my local cluster with its internal address the local reduce task can not access the map files on the ec2 machine because with the default hostname. I get a error: WARN org.apache.hadoop.mapred.ReduceTask: java.net.UnknownHostException: domU-12-31-39-00-A4-05.compute-1.internal question Is there a automated way to start a tasktracker on a ec2 machine with it useing the public hostname so the local task can get the maps from the ec2 machines? example something like bin/hadoop-daemon.sh start tasktracker host=ec2-xx-xx-xx-xx.z-2.compute-1.amazonaws.com That I can run to start just the tasktracker with the correct hostname /question What I am trying to do is build a custom ami image that I can just launch when need to add extra cpu power to my cluster and to automatically start the tasktracker vi a shell script that can be ran at startup. Billy Billy Pearson [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] I have a question someone may have answered here before but I can not find the answer. Assuming I have a cluster of servers hosting a large amount of data I want to run a large job that the maps take a lot of cpu power to run and the reduces only take a small amount cpu to run. I want to run the maps on a group of EC2 servers and run the reduces on the local cluster of 10 machines. The problem I am seeing is the map outputs, if I run the maps on EC2 they are stored local on the instance What I am looking to do is have the map output files stored in hdfs so I can kill the EC2 instances sense I do not need them for the reduces. The only way I can thank to do this is run two jobs one maper and store the output on hdfs and then run a second job to run the reduces from the map outputs store on the hfds. Is there away to make the mappers store the final output in hdfs?