[
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885978#action_12885978
]
Chris A. Mattmann commented on NUTCH-843:
-----------------------------------------
OK, so I read this more. I think it would be great if we didn't have to
maintain 2 diff deployment structures based on local/remote. Some comments on
your local proposal:
{code}
bin/nutch - the main nutch script
conf/ - all relevant Nutch conf files
lib/hadoop-libs - static Hadoop lib files - are these jar files?
lib/nutch-libs - what are these? jar files?
plugins/ - are these the plugin directories, or plugin jar files?
nutch.jar - why wouldn't this go into the lib directory?
{code}
I could envision having one simple deployment structure that looked like this:
./bin/ - nutch script goes into here
./etc/ - all Nutch configuration property files, like
nutch-default.xml, nutch-site.xml
./lib/ - all shared Nutch jar files (including the nutch.jar and
hadoop.jar, as well as deps). Also it would be great to be able to generate a
per-plugin jars that we could include in this lib directory as well.
./logs/ - where all log files are written to
./run/ - where PID files (if generated) are written to
Thoughts?
> Separate the build and runtime environments
> -------------------------------------------
>
> Key: NUTCH-843
> URL: https://issues.apache.org/jira/browse/NUTCH-843
> Project: Nutch
> Issue Type: Improvement
> Components: build
> Affects Versions: 2.0
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
>
> Currently there is no clean separation of source, build and runtime
> artifacts. On one hand, it makes it easier to get started in local mode, but
> on the other hand it makes the distributed (or pseudo-distributed) setup much
> more challenging and tricky. Also, some resources (config files and classes)
> are included several times on the classpath, they are loaded under different
> classloaders, and in the end it's not obvious what copy and why takes
> precedence.
> Here's an example of a harmful unintended behavior caused by this mess:
> Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on
> their classpath. This means that a task running on this cluster will have two
> copies of resources from these locations - one from the inherited classpath
> from tasktracker, and the other one from the just unpacked nutch.job file. If
> these two versions differ, only the first one will be loaded, which in this
> case is the one taken from the (unpacked) conf/ and build/ - the other one,
> from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will
> be shipped to the new nodes as a part of each task setup, but now the remote
> tasktracker child processes will use resources from nutch.job - so some tasks
> will use different versions of resources than other tasks. This usually leads
> to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following
> areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong
> to this category too, so there will be no top-level bin/. nutch-default.xml
> belongs to this category too. Other customizable files can be moved to
> src/conf too, or they could stay in top-level conf/ as today, with a README
> that explains that changes made there take effect only after you rebuild the
> job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run
> Nutch jobs. For a distributed setup that uses an existing Hadoop cluster
> (installed from plain vanilla Hadoop release) this will be a {{/deploy}}
> directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are
> already included in the job jar. These resources can be copied directly to
> the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}}
> directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that
> the plugins/ directory be unpacked from the job jar. And we need the hadoop
> libs to run in the local mode. We may later on refine this local setup to
> something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar
> (which actually would not be used in this case).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.