configuration basics
On Fri, Jun 27, 2008 at 9:15 AM, Rick Cox [EMAIL PROTECTED] wrote: Yes, mapred.tasktracker.map.tasks.maximum is configured per tasktracker on startup. It can't be configured per job because it's not a job-scope parameter (if there are multiple concurrent jobs, they have to share the task limit). rick Is there a good way to discover which parameters can be configured on a job basis, vs a tasktracker or site basis? Eg I'd like to change my dfs.replication.min when I add new nodes to my cluster, if possible. Restarting dfs leaves me with namenodeId mismatches (and that's not good), so that's not really an option. Thanks! Chris -- Chris Anderson http://jchris.mfdz.com
Re: process limits for streaming jar
Having experimented some more, I've found that the simple solution is to limit the resource usage by limiting the # of map tasks and the memory they are allowed to consume. I'm specifying the constraints on the command line like this: -jobconf mapred.tasktracker.map.tasks.maximum=2 mapred.child.ulimit=1048576 The configuration parameters seem to take, in the job.xml available from the web console, I see these lines: mapred.child.ulimit 1048576 mapred.tasktracker.map.tasks.maximum2 The problem is that when there are a large number of map tasks to complete, Hadoop doesn't seem to obey the map.tasks.maximum. Instead, it is spawning 8 map tasks per tasktracker (even when I change the mapred.tasktracker.map.tasks.maximum in hadoop-site.xml to 2, on the master). The cluster was booted with the setting at 8. Do I need to change hadoop-site.xml on all the slaves, and restart the task trackers, in order to make the limit apply? That seems unlikely - I'd really like to manage this parameter on a per-job level. Thanks for any input! Chris -- Chris Anderson http://jchris.mfdz.com
process limits for streaming jar
Hi there, I'm running some streaming jobs on ec2 (ruby parsing scripts) and in my most recent test I managed to spike the load on my large instances to 25 or so. As a result, I lost communication with one instance. I think I took down sshd. Whoops. My question is, has anyone got strategies for managing resources used by the processes spawned by streaming jar? Ideally I'd like to run my ruby scripts under nice. I can hack something together with wrappers, but I'm thinking there might be a configuration option to handle this within Streaming jar. Thanks for any suggestions! -- Chris Anderson http://jchris.mfdz.com
Re: realtime hadoop
Vadim, Depending on the nature of your data, CouchDB (http://couchdb.org) might be worth looking into. It speaks JSON natively, and has real-time map/reduce support. The 0.8.0 release is imminent (don't bother with 0.7.2), and the community is active. We're using it for something similar to what you describe, and it's working well. Chris -- Chris Anderson http://jchris.mfdz.com
contrib EC2 with hadoop 0.17
First of all, thanks to whoever maintains the hadoop-ec2 scripts. They've saved us untold time and frustration getting started with a small testing cluster (5 instances). A question: when we log into the newly created cluster, and run jobs from the example jar (pi, etc) everything works great. We expect our custom jobs will run just as smoothly. However, when we restart the namenodes and tasktrackers by running bin/stop-all.sh on the master, it tries to stop only activity on localhost. Running start-all.sh then boots up a localhost-only cluster (on which jobs run just fine). The only way we've been able to recover from this situation is to use bin/terminate-hadoop-cluster and bin/destroy-hadoop-cluster and then start again from scratch with a new cluster. There must be a simple way to restart the namenodes and jobtrackers across all machines from the master. Also, I think understanding the answer to this question might put a lot more into perspective for me, so I can go on to do more advanced things on my own. Thanks for any assistance / insight! Chris output from stop-all.sh == stopping jobtracker localhost: Warning: Permanently added 'localhost' (RSA) to the list of known hosts. localhost: no tasktracker to stop stopping namenode localhost: no datanode to stop localhost: no secondarynamenode to stop conf files in /usr/local/hadoop-0.17.0 == # cat conf/slaves localhost # cat conf/masters localhost -- Chris Anderson http://jchris.mfdz.com
Re: hadoop on EC2
Andreas, If you can ssh into the nodes, you can always set up port-forwarding with ssh -L to bring those ports to your local machine. On Wed, May 28, 2008 at 1:51 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote: What I wonder is what ports do I need to access? 50060 on all nodes. 50030 on the jobtracker. Any other ports? Andreas Am Mittwoch, den 28.05.2008, 13:37 -0700 schrieb Allen Wittenauer: On 5/28/08 1:22 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote: I just wondered what other people use to access the hadoop webservers, when running on EC2? While we don't run on EC2 :), we do protect the hadoop web processes by putting a proxy in front of it. A user connects to the proxy, authenticates, and then gets the output from the hadoop process. All of the redirection magic happens via a localhost connection, so no data is leaked unprotected. -- Chris Anderson http://jchris.mfdz.com
Re: hadoop on EC2
On Wed, May 28, 2008 at 2:23 PM, Ted Dunning [EMAIL PROTECTED] wrote: That doesn't work because the various web pages have links or redirects to other pages on other machines. Also, you would need to ssh to ALL of your cluster to get the file browser to work. True. That makes it a little impractical. Better to do the proxy thing. This would be a nice addition to the Hadoop EC2 AMI (which is super helpful, by the way). Thanks to whoever put it together. -- Chris Anderson http://jchris.mfdz.com