[ 
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-843:
------------------------------------

    Attachment: NUTCH-843.patch

Updated patch that moves nutch.jar to lib/ for the local runtime.

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: NUTCH-843.patch, NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime 
> artifacts. On one hand, it makes it easier to get started in local mode, but 
> on the other hand it makes the distributed (or pseudo-distributed) setup much 
> more challenging and tricky. Also, some resources (config files and classes) 
> are included several times on the classpath, they are loaded under different 
> classloaders, and in the end it's not obvious what copy and why takes 
> precedence.
> Here's an example of a harmful unintended behavior caused by this mess: 
> Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on 
> their classpath. This means that a task running on this cluster will have two 
> copies of resources from these locations - one from the inherited classpath 
> from tasktracker, and the other one from the just unpacked nutch.job file. If 
> these two versions differ, only the first one will be loaded, which in this 
> case is the one taken from the (unpacked) conf/ and build/ - the other one, 
> from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will 
> be shipped to the new nodes as a part of each task setup, but now the remote 
> tasktracker child processes will use resources from nutch.job - so some tasks 
> will use different versions of resources than other tasks. This usually leads 
> to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following 
> areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong 
> to this category too, so there will be no top-level bin/. nutch-default.xml 
> belongs to this category too. Other customizable files can be moved to 
> src/conf too, or they could stay in top-level conf/ as today, with a README 
> that explains that changes made there take effect only after you rebuild the 
> job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run 
> Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
> (installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
> directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are 
> already included in the job jar. These resources can be copied directly to 
> the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
> directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that 
> the plugins/ directory be unpacked from the job jar. And we need the hadoop 
> libs to run in the local mode. We may later on refine this local setup to 
> something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar 
> (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to