The jobs will be of different sizes and some may take days to complete with only 5 machines, so yes some will run night and day.
By scale back, I mean scale back on system resources (CPU, IO, RAM) so the machine can be used for other tasks during the day. I understand (as you pointed out) I can reduce the resources Hadoop uses by editing the hadoop-env.sh and hadoop-site.xml but only at startup, there is no way to do this on the fly so to speak. Is that correct? I think ideally a way to suspend and continue a job is preferable to scaling back on resources. i.e write current progress to disk in the morning and suspend processing and then start up again where it left off at night. Cheers, John 2009/5/19 Kevin Weil <kevinw...@gmail.com> > Will your jobs be running night and day, or just over a specified period? > Depending on your setup, and on what you mean by "scale back" (CPU vs disk > IO vs memory), you could potentially restart your cluster with different > settings at different times of the day via cron. This will kill any > running > jobs, so it'll only work if you can find or create a few free minutes. But > then you could scale back on CPU by running with HADOOP_NICENESS nonzero > (see conf/hadoop-env.sh), you could scale back on memory by setting the > various process memory limits low in conf/hadoop-site.xml, and you could > scale back on datanode work entirely by setting the maximum number of > mappers or reducers to 1 per node during the day (also in > conf/hadoop-site.xml). > > Kevin > > On Tue, May 19, 2009 at 7:23 AM, Steve Loughran <ste...@apache.org> wrote: > > > John Clarke wrote: > > > >> Hi, > >> > >> I am working on a project that is suited to Hadoop and so want to create > a > >> small cluster (only 5 machines!) on our servers. The servers are however > >> used during the day and (mostly) idle at night. > >> > >> So, I want Hadoop to run at full throttle at night and either scale back > >> or > >> suspend itself during certain times. > >> > > > > You could add/remove new task trackers on idle systems, but > > * you don't want to take away datanodes, as there's a risk that data will > > become unavailable. > > * there's nothing in the scheduler to warn that machines will go away at > a > > certain time > > If you only want to run the cluster at night, I'd just configure the > entire > > cluster to go up and down > > >