Hi Edoardo, > take org.apache.nutch.crawl.Generator for example... > I have the feeling this is a bit redundant, and we could just define > nutch.crawldb.path and > nutch.segments.path options, and pass them - as well as all the other - to > the hadoop runner as -D > options (granted this could result in a pretty long command, but not for the > script user who > should still rely on the current usage syntax.)
Fully agreed for all command-line arguments which point to properties, e.g., "-topN" etc., even temporary properties used only to pass command-line values to map and reduce jobs, eg. "generate.topN", see https://wiki.apache.org/nutch/NutchPropertiesCompleteList. Regarding input and output paths: yes, they are more or less constants, path to crawldb will hardly change for one and the same crawl. So they could be read from properties, although input/output paths are never passed to jobs via properties (need .addInputPath, .setOutPath, etc.) > The bash script can be tweaked to preserve the same command line syntax, and > just generate a list > of -D option overrides for the final hadoop invocation. If the transfer from back-ward compatible command-line args to -D properties is delegated to bin/nutch, this would also mean that the knowledge about the arguments of each tool has to be added to bin/nutch. Currently, it's contained in each class in the run() method. It should be still there, but we could relax the tool classes and make the arguments optional. If passed via command-line the property is overridden. (from Chris Mattman): > so it would be nice to make the back compat and > like you said non disruptive. Right. That's the point: nobody wants to rewrite customized scripts are re-learn using Nutch tools. Thanks, Chris! Thanks (and hope to see a patch), Sebastian On 09/24/2014 04:16 PM, Edoardo Causarano wrote: > Hi Sebastian, > > take org.apache.nutch.crawl.Generator for example. The way the class is > written, the main method > will pass the command line options to the actual run method and if crawldb > and segments paths are > missing it will bail out. Other command line parameters are parsed in the run > method to possibly > override other site defined options. > > I have the feeling this is a bit redundant, and we could just define > nutch.crawldb.path and > nutch.segments.path options, and pass them - as well as all the other - to > the hadoop runner as -D > options (granted this could result in a pretty long command, but not for the > script user who should > still rely on the current usage syntax.) > > The bash script can be tweaked to preserve the same command line syntax, and > just generate a list of > -D option overrides for the final hadoop invocation. > > This way one can define all necessary paths in the site.xml file or as Hadoop > options in HUE/Oozie > (as opposed to parameterized command line arguments in every workflow step.) > > In the end the main methods (or run() ) wouldn't do any command line parsing > and just rely on the > set of -D options defined externally (either by the bash script, oozie > workflow, etc...) > > What do you think? > > > Best, > Edoardo > > On Wed, Sep 24, 2014 at 3:34 PM, Sebastian Nagel <[email protected] > <mailto:[email protected]>> wrote: > > Hi Edoardo, > > > To make things easy I've used the JavaMain action to execute the classes > > that the nutch scripts invokes, parametrized as necessary. > Ok. That means that each step (inject, generate, fetch, etc.) runs in its > own JVM. Right? > > > One thing that I noticed is that I found configuring the command line > arguments > > a tad cumbersome so: would it be unthinkable to adopt the Hadoop -D > configuration > > setting convention to set these options? > > All tool classes (Injector, Generator, Fetcher, ParseSegment, CrawlDb, > LinkDb, IndexingJob) > implement Tool and should process -D hadoop.property=value options. If > some classes do not, > please, report it in Jira, provide a patch and/or we'll fix it. > > > bash scripts could still hide the extra verbosity and preserve the > current args, > > while adding the option to define them in nutch-site.xml or in Oozie > under > > a more practical element. > > Can you explain what "extra verbosity" means? > > Thanks, > Sebastian > > > > 2014-09-24 11:39 GMT+02:00 Edoardo Causarano <[email protected] > <mailto:[email protected]>>: > > Hi all, > > I've been busy lately with a Nutch 1.x setup and I've managed to > replicate the crawl script > into an Oozie workflow (and HUE for pretty web UI). To make things > easy I've used the > JavaMain action to execute the classes that the nutch scripts > invokes, parametrized as > necessary. > > One thing that I noticed is that I found configuring the command line > arguments a tad > cumbersome so: would it be unthinkable to adopt the Hadoop -D > configuration.setting > convention to set these options? > > bash scripts could still hide the extra verbosity and preserve the > current args, while > adding the option to define them in nutch-site.xml or in Oozie under > a more practical element. > > The patch wouldn't be too disruptive, but I don't want to do work > that wouldn't be folded > into upstream so let me know if such an approach flies in the face of > community wide > decisions and so on... > > > Best, > Edoardo > > -- > A Motto > Smile a while, and while you smile > another smiles > And soon there's miles and miles > of smiles > And life's worth while because > you smile > > > > > > -- > A Motto > Smile a while, and while you smile > another smiles > And soon there's miles and miles > of smiles > And life's worth while because > you smile >

