Andrzej Bialecki wrote:
Example: what happens now if you try to run more than one fetcher at the same time, where the fetcher parameters differ (or a set of activated plugins differs)? You can't - the local tasks on each tasktracker will use whatever local config is there.
That's true when mapred.job.tracker=local, but when things are distributed the config can vary since each task is spawned in a separate JVM with a separate classpath. The nutch-site.xml on each node can never be overidden. For example, so long as plugin.includes is not specified in nutch-site.xml on each node, then each task can override plugin.includes to use different plugins.
Also note that plugin implementations can submitted in a jar file with the job, and plugin.folders can be overridden in the job to find the new plugins. So a job jar might include a folder named "my.plugins" and set plugin.folders to "my.plugins, plugins", then alter plugin.includes to include job-specific plugins.
What happens if you change the config on a node that submits the job? The changes won't be propagated to the tasktracker nodes, because tasktrackers use local configuration (through a singleton NutchConf.get()), instead of supplying a serialized/deserialized instance of the config from the originating node... etc.
Again, I'm not sure this is a problem. Properties which tasks should be able to override should not be specified in nutch-site.xml, but rather in mapred-default.xml. Lots of job-specific properties are currently passed this way.
Another use case for eliminating the static uses of NutchConf is to simplify the construction of a configuration gui. It would be nice to have a web-based interface which permits one to configure parameters and then have it run the system. This should be able to run multiple Nutch instances in a single JVM. For example, a single Nutch-based "search appliance" daemon should be able to crawl and search both your intranet and your public websites, each configured separately.
Doug
