Separate the build and runtime environments
-------------------------------------------
Key: NUTCH-843
URL: https://issues.apache.org/jira/browse/NUTCH-843
Project: Nutch
Issue Type: Improvement
Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki
Currently there is no clean separation of source, build and runtime artifacts.
On one hand, it makes it easier to get started in local mode, but on the other
hand it makes the distributed (or pseudo-distributed) setup much more
challenging and tricky. Also, some resources (config files and classes) are
included several times on the classpath, they are loaded under different
classloaders, and in the end it's not obvious what copy and why takes
precedence.
Here's an example of a harmful unintended behavior caused by this mess: Hadoop
daemons (jobtracker and tasktracker) will get conf/ and build/ on their
classpath. This means that a task running on this cluster will have two copies
of resources from these locations - one from the inherited classpath from
tasktracker, and the other one from the just unpacked nutch.job file. If these
two versions differ, only the first one will be loaded, which in this case is
the one taken from the (unpacked) conf/ and build/ - the other one, from within
the nutch.job file, will be ignored.
It's even worse when you add more nodes to the cluster - the nutch.job will be
shipped to the new nodes as a part of each task setup, but now the remote
tasktracker child processes will use resources from nutch.job - so some tasks
will use different versions of resources than other tasks. This usually leads
to a host of very difficult to debug issues.
This issue proposes then to separate these environments into the following
areas:
* source area - i.e. our current sources. Note that bin/ scripts will belong to
this category too, so there will be no top-level bin/. nutch-default.xml
belongs to this category too. Other customizable files can be moved to src/conf
too, or they could stay in top-level conf/ as today, with a README that
explains that changes made there take effect only after you rebuild the job jar.
* build area - contains build artifacts, among them the nutch.job jar.
* runtime (or deploy) area - this area contains all artifacts needed to run
Nutch jobs. For a distributed setup that uses an existing Hadoop cluster
(installed from plain vanilla Hadoop release) this will be a {{/deploy}}
directory, where we put the following:
{code}
bin/nutch
nutch.job
{code}
That's it - nothing else should be needed, because all other resources are
already included in the job jar. These resources can be copied directly to the
master Hadoop node.
For a local setup (using LocalJobTracker) this will be a {{/runtime}}
directory, where we put the following:
{code}
bin/nutch
lib/hadoop-libs
plugins/
nutch.job
{code}
Due to limitations in the PluginClassLoader the local runtime requires that the
plugins/ directory be unpacked from the job jar. And we need the hadoop libs to
run in the local mode. We may later on refine this local setup to something
like this:
{code}
bin/nutch
conf/
lib/hadoop-libs
lib/nutch-libs
plugins/
nutch.jar
{code}
so that it's easier to modify the config without rebuilding the job jar (which
actually would not be used in this case).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.