[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Chris A. Mattmann (JIRA) Wed, 07 Jul 2010 09:33:16 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885978#action_12885978
 ]


Chris A. Mattmann commented on NUTCH-843:
-----------------------------------------

OK, so I read this more. I think it would be great if we didn't have to 
maintain 2 diff deployment structures based on local/remote. Some comments on 
your local proposal:

{code}
bin/nutch           - the main nutch script
conf/                 - all relevant Nutch conf files
lib/hadoop-libs  - static Hadoop lib files - are these jar files?
lib/nutch-libs     - what are these? jar files?
plugins/             - are these the plugin directories, or plugin jar files? 
nutch.jar           -  why wouldn't this go into the lib directory?
{code}

I could envision having one simple deployment structure that looked like this:

./bin/          - nutch script goes into here
./etc/          - all Nutch configuration property files, like 
nutch-default.xml, nutch-site.xml
./lib/           - all shared Nutch jar files (including the nutch.jar and 
hadoop.jar, as well as deps). Also it would be great to be able to generate a 
per-plugin jars that we could include in this lib directory as well. 
./logs/        - where all log files are written to
./run/         - where PID files (if generated) are written to

Thoughts?
 

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>
> Currently there is no clean separation of source, build and runtime 
> artifacts. On one hand, it makes it easier to get started in local mode, but 
> on the other hand it makes the distributed (or pseudo-distributed) setup much 
> more challenging and tricky. Also, some resources (config files and classes) 
> are included several times on the classpath, they are loaded under different 
> classloaders, and in the end it's not obvious what copy and why takes 
> precedence.
> Here's an example of a harmful unintended behavior caused by this mess: 
> Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on 
> their classpath. This means that a task running on this cluster will have two 
> copies of resources from these locations - one from the inherited classpath 
> from tasktracker, and the other one from the just unpacked nutch.job file. If 
> these two versions differ, only the first one will be loaded, which in this 
> case is the one taken from the (unpacked) conf/ and build/ - the other one, 
> from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will 
> be shipped to the new nodes as a part of each task setup, but now the remote 
> tasktracker child processes will use resources from nutch.job - so some tasks 
> will use different versions of resources than other tasks. This usually leads 
> to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following 
> areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong 
> to this category too, so there will be no top-level bin/. nutch-default.xml 
> belongs to this category too. Other customizable files can be moved to 
> src/conf too, or they could stay in top-level conf/ as today, with a README 
> that explains that changes made there take effect only after you rebuild the 
> job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run 
> Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
> (installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
> directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are 
> already included in the job jar. These resources can be copied directly to 
> the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
> directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that 
> the plugins/ directory be unpacked from the job jar. And we need the hadoop 
> libs to run in the local mode. We may later on refine this local setup to 
> something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar 
> (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Reply via email to