Re: Question about Hadoop

2008-06-14 Thread Chanchal James
Thank you very much for explaining it to me, Ted.. Thats a great deal of
info!
I guess that could be how Yahoo Webmap is designed..

And for anyone trying to figure out the massiveness of Hadoop computing,
http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/should
give a good picture of a practical case. I was for a moment
flabbergasted, and instantly fell in love with Hadoop! ;)


On Sat, Jun 14, 2008 at 12:11 AM, Ted Dunning [EMAIL PROTECTED] wrote:

 Usually hadoop programs are not used interactively since what they excel at
 is batch operations on very large collections of data.

 It is quite reasonable to store resulting data in hadoop and access those
 results using hadoop.  The cleanest way to do that is to have a
 presentation
 layer web server that has all of the UI on it and use http to access the
 results file from hadoop via the namenodes data access URL.  This works
 well
 where the results are not particularly voluminous.

 For large quantities of data such as the output of a web-crawl, it is
 usually better to copy the output out of hadoop and into a clustered system
 that supports high speed querying of the data.  This clustered system might
 be as simple as a redundant memcache or mySql farm or as fancy as a sharded
 and replicated farm of text retrieval engines running under Solr.  What
 works for you will vary by what you need to do.

 You should keep in mind that hadoop was designed for very long MTBF (for a
 cluster), but not designed for zero downtime operation.  At the very least,
 you will occasionally want to upgrade the cluster software and that
 currently can't be done during normal operations.  Combining hadoop (for
 heavy duty computations) with a separate persistence layer (for high
 availability web service) is a good hybrid.

 On Thu, Jun 12, 2008 at 9:53 PM, Chanchal James [EMAIL PROTECTED]
 wrote:

  Thank you all for the responses.
 
  So in order to run a web-based application, I just need to put the part
 of
  the application that needs to make use of distributed computation in
 HDFS,
  and have the other web site related files access it via Hadoop streaming
 ?
 
  Is that how Hadoop is used ?
 
  Sorry the question may sound too silly.
 
  Thank you.
 
 

 --
 ted



Failed Reduce Task

2008-06-14 Thread chanel

Hey everyone,

I'm trying to get the hang of using Hadoop and I'm using the Michael 
Noll Ubuntu tutorials 
(http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)).  
Using the wordcount example that comes with version 0.17.1-dev I get 
this error output:


08/06/14 15:17:45 INFO mapred.FileInputFormat: Total input paths to 
process : 6

08/06/14 15:17:46 INFO mapred.JobClient: Running job: job_200806141506_0003
08/06/14 15:17:47 INFO mapred.JobClient:  map 0% reduce 0%
08/06/14 15:17:53 INFO mapred.JobClient:  map 12% reduce 0%
08/06/14 15:17:54 INFO mapred.JobClient:  map 25% reduce 0%
08/06/14 15:17:55 INFO mapred.JobClient:  map 37% reduce 0%
08/06/14 15:17:57 INFO mapred.JobClient:  map 50% reduce 0%
08/06/14 15:17:58 INFO mapred.JobClient:  map 75% reduce 0%
08/06/14 15:18:00 INFO mapred.JobClient:  map 100% reduce 0%
08/06/14 15:18:03 INFO mapred.JobClient:  map 100% reduce 1%
08/06/14 15:18:09 INFO mapred.JobClient:  map 100% reduce 13%
08/06/14 15:18:16 INFO mapred.JobClient:  map 100% reduce 18%
08/06/14 15:20:49 INFO mapred.JobClient: Task Id : 
task_200806141506_0003_m_01_0, Status : FAILED

Too many fetch-failures
08/06/14 15:20:51 INFO mapred.JobClient:  map 87% reduce 18%
08/06/14 15:20:52 INFO mapred.JobClient:  map 100% reduce 18%
08/06/14 15:20:56 INFO mapred.JobClient:  map 100% reduce 19%
08/06/14 15:21:01 INFO mapred.JobClient:  map 100% reduce 20%
08/06/14 15:21:05 INFO mapred.JobClient:  map 100% reduce 16%
08/06/14 15:21:05 INFO mapred.JobClient: Task Id : 
task_200806141506_0003_r_01_0, Status : FAILED

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

This is with 2 nodes (master and slave) using the default values in 
/hadoop/conf/hadoop-default.xml and then increasing the number of reduce 
tasks to 3 and 5 to see if this changed anything (which it didn't).  I'm 
wondering if anybody had this type of problem before and how to fix it?  
Thanks for any help.


-Chanel





Guide to running Hadoop on Windows

2008-06-14 Thread Hayes Davis
I recently got started using Hadoop and spent some time getting distributed
Hadoop running on Windows. I didn't find much on Google about running on
Windows and the docs are a tad vague on the subject. Anyway, for anyone
that's interested, I've written up a guide based on my experience called
Running Hadoop on Windows at
http://hayesdavis.net/2008/06/14/running-hadoop-on-windows/

Hopefully I didn't duplicate work someone's done elsewhere and hopefully
this will help anyone out there wanting to get started with distributed
Hadoop on Windows.

If anyone finds any issues in the guide or has questions, please let me know
or just comment on the blog post.

Thanks.

Hayes

-- 
Hayes Davis
[EMAIL PROTECTED]
http://hayesdavis.net


Ec2 and MR Job question

2008-06-14 Thread Billy Pearson
I have a question someone may have answered here before but I can not find 
the answer.


Assuming I have a cluster of servers hosting a large amount of data
I want to run a large job that the maps take a lot of cpu power to run and 
the reduces only take a small amount cpu to run.
I want to run the maps on a group of EC2 servers and run the reduces on the 
local cluster of 10 machines.


The problem I am seeing is the map outputs, if I run the maps on EC2 they 
are stored local on the instance
What I am looking to do is have the map output files stored in hdfs so I can 
kill the EC2 instances sense I do not need them for the reduces.


The only way I can thank to do this is run two jobs one maper and store the 
output on hdfs and then run a second job to run the reduces

from the map outputs store on the hfds.

Is there away to make the mappers store the final output in hdfs? 





Re: Ec2 and MR Job question

2008-06-14 Thread Chris K Wensel
well, to answer your last question first, just set the # reducers to  
zero.


but you can't just run reducers without mappers (as far as I know,  
having never tried). so your local job will need to run identity  
mappers in order to feed your reducers.

http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html

ckw

On Jun 14, 2008, at 1:31 PM, Billy Pearson wrote:

I have a question someone may have answered here before but I can  
not find the answer.


Assuming I have a cluster of servers hosting a large amount of data
I want to run a large job that the maps take a lot of cpu power to  
run and the reduces only take a small amount cpu to run.
I want to run the maps on a group of EC2 servers and run the reduces  
on the local cluster of 10 machines.


The problem I am seeing is the map outputs, if I run the maps on EC2  
they are stored local on the instance
What I am looking to do is have the map output files stored in hdfs  
so I can kill the EC2 instances sense I do not need them for the  
reduces.


The only way I can thank to do this is run two jobs one maper and  
store the output on hdfs and then run a second job to run the reduces

from the map outputs store on the hfds.

Is there away to make the mappers store the final output in hdfs?



--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/







Re: Ec2 and MR Job question

2008-06-14 Thread Billy Pearson

I understand how to run it as two jobs my only question is
Is there away to make the mappers store the final output in hdfs?

so I can kill the ec2 machines without waiting to the reduce stage ends!

Billy



Chris K Wensel [EMAIL PROTECTED] wrote in 
message news:[EMAIL PROTECTED]
well, to answer your last question first, just set the # reducers to 
zero.


but you can't just run reducers without mappers (as far as I know,  having 
never tried). so your local job will need to run identity  mappers in 
order to feed your reducers.

http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html

ckw

On Jun 14, 2008, at 1:31 PM, Billy Pearson wrote:

I have a question someone may have answered here before but I can  not 
find the answer.


Assuming I have a cluster of servers hosting a large amount of data
I want to run a large job that the maps take a lot of cpu power to  run 
and the reduces only take a small amount cpu to run.
I want to run the maps on a group of EC2 servers and run the reduces  on 
the local cluster of 10 machines.


The problem I am seeing is the map outputs, if I run the maps on EC2 
they are stored local on the instance
What I am looking to do is have the map output files stored in hdfs  so I 
can kill the EC2 instances sense I do not need them for the  reduces.


The only way I can thank to do this is run two jobs one maper and  store 
the output on hdfs and then run a second job to run the reduces

from the map outputs store on the hfds.

Is there away to make the mappers store the final output in hdfs?



--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/











Re: Ec2 and MR Job question

2008-06-14 Thread Billy Pearson
My second question is about the ec2 machines has anyone solved the hostname 
problem in a automated way?


Example if I launch a ec2 server to run a task tracker the hostname reported 
back to my local cluster with its internal address
the local reduce task can not access the map files on the ec2 machine 
because with the default hostname.

I get a error:
WARN org.apache.hadoop.mapred.ReduceTask: java.net.UnknownHostException: 
domU-12-31-39-00-A4-05.compute-1.internal


question
Is there a automated way to start a tasktracker on a ec2 machine with it 
useing the public hostname so the local task can get the maps from the ec2 
machines?

example something like
bin/hadoop-daemon.sh start tasktracker 
host=ec2-xx-xx-xx-xx.z-2.compute-1.amazonaws.com


That I can run to start just the tasktracker with the correct hostname
/question

What I am trying to do is build a custom ami image that I can just launch 
when need to add extra cpu power to my cluster and to automatically start

the tasktracker vi a shell script that can be ran at startup.

Billy


Billy Pearson [EMAIL PROTECTED] 
wrote in message news:[EMAIL PROTECTED]
I have a question someone may have answered here before but I can not find 
the answer.


Assuming I have a cluster of servers hosting a large amount of data
I want to run a large job that the maps take a lot of cpu power to run and 
the reduces only take a small amount cpu to run.
I want to run the maps on a group of EC2 servers and run the reduces on 
the local cluster of 10 machines.


The problem I am seeing is the map outputs, if I run the maps on EC2 they 
are stored local on the instance
What I am looking to do is have the map output files stored in hdfs so I 
can kill the EC2 instances sense I do not need them for the reduces.


The only way I can thank to do this is run two jobs one maper and store 
the output on hdfs and then run a second job to run the reduces

from the map outputs store on the hfds.

Is there away to make the mappers store the final output in hdfs?