[
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887317#action_12887317
]
Julien Nioche commented on NUTCH-843:
-------------------------------------
revision 963217 : removed task extract-hadoop from Ant build to avoid creation
of hadoop scripts in bin dir
@pham : your comment is not relevant to this issue. please create a separate
issue, thanks
> Separate the build and runtime environments
> -------------------------------------------
>
> Key: NUTCH-843
> URL: https://issues.apache.org/jira/browse/NUTCH-843
> Project: Nutch
> Issue Type: Improvement
> Components: build
> Affects Versions: 2.0
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Fix For: 2.0
>
> Attachments: NUTCH-843.patch, NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime
> artifacts. On one hand, it makes it easier to get started in local mode, but
> on the other hand it makes the distributed (or pseudo-distributed) setup much
> more challenging and tricky. Also, some resources (config files and classes)
> are included several times on the classpath, they are loaded under different
> classloaders, and in the end it's not obvious what copy and why takes
> precedence.
> Here's an example of a harmful unintended behavior caused by this mess:
> Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on
> their classpath. This means that a task running on this cluster will have two
> copies of resources from these locations - one from the inherited classpath
> from tasktracker, and the other one from the just unpacked nutch.job file. If
> these two versions differ, only the first one will be loaded, which in this
> case is the one taken from the (unpacked) conf/ and build/ - the other one,
> from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will
> be shipped to the new nodes as a part of each task setup, but now the remote
> tasktracker child processes will use resources from nutch.job - so some tasks
> will use different versions of resources than other tasks. This usually leads
> to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following
> areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong
> to this category too, so there will be no top-level bin/. nutch-default.xml
> belongs to this category too. Other customizable files can be moved to
> src/conf too, or they could stay in top-level conf/ as today, with a README
> that explains that changes made there take effect only after you rebuild the
> job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run
> Nutch jobs. For a distributed setup that uses an existing Hadoop cluster
> (installed from plain vanilla Hadoop release) this will be a {{/deploy}}
> directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are
> already included in the job jar. These resources can be copied directly to
> the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}}
> directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that
> the plugins/ directory be unpacked from the job jar. And we need the hadoop
> libs to run in the local mode. We may later on refine this local setup to
> something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar
> (which actually would not be used in this case).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.