Hi Edoardo,

> take org.apache.nutch.crawl.Generator for example...
> I have the feeling this is a bit redundant, and we could just define 
> nutch.crawldb.path and
> nutch.segments.path options, and pass them - as well as all the other - to 
> the hadoop runner as -D
> options (granted this could result in a pretty long command, but not for the 
> script user who
> should still rely on the current usage syntax.)

Fully agreed for all command-line arguments which point to
properties, e.g., "-topN" etc., even temporary properties used
only to pass command-line values to map and reduce jobs,
eg. "generate.topN", see 
https://wiki.apache.org/nutch/NutchPropertiesCompleteList.

Regarding input and output paths: yes, they are more or less constants,
path to crawldb will hardly change for one and the same crawl.
So they could be read from properties, although input/output paths
are never passed to jobs via properties (need .addInputPath, .setOutPath, etc.)

> The bash script can be tweaked to preserve the same command line syntax, and 
> just generate a list
> of -D option overrides for the final hadoop invocation.

If the transfer from back-ward compatible command-line args to -D properties
is delegated to bin/nutch, this would also mean that the knowledge
about the arguments of each tool has to be added to bin/nutch.
Currently, it's contained in each class in the run() method.
It should be still there, but we could relax the tool classes
and make the arguments optional. If passed via command-line
the property is overridden.

(from Chris Mattman):
> so it would be nice to make the back compat and
> like you said non disruptive.
Right. That's the point: nobody wants to rewrite customized
scripts are re-learn using Nutch tools. Thanks, Chris!

Thanks (and hope to see a patch),
Sebastian


On 09/24/2014 04:16 PM, Edoardo Causarano wrote:
> Hi Sebastian,
> 
> take org.apache.nutch.crawl.Generator for example. The way the class is 
> written, the main method
> will pass the command line options to the actual run method and if crawldb 
> and segments paths are
> missing it will bail out. Other command line parameters are parsed in the run 
> method to possibly
> override other site defined options.  
> 
> I have the feeling this is a bit redundant, and we could just define 
> nutch.crawldb.path and
> nutch.segments.path options, and pass them - as well as all the other - to 
> the hadoop runner as -D
> options (granted this could result in a pretty long command, but not for the 
> script user who should
> still rely on the current usage syntax.)
> 
> The bash script can be tweaked to preserve the same command line syntax, and 
> just generate a list of
> -D option overrides for the final hadoop invocation.
> 
> This way one can define all necessary paths in the site.xml file or as Hadoop 
> options in HUE/Oozie
> (as opposed to parameterized command line arguments in every workflow step.)
> 
> In the end the main methods (or run() ) wouldn't do any command line parsing 
> and just rely on the
> set of -D options defined externally (either by the bash script, oozie 
> workflow, etc...)
> 
> What do you think?
> 
> 
> Best,
> Edoardo   
> 
> On Wed, Sep 24, 2014 at 3:34 PM, Sebastian Nagel <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Hi Edoardo,
> 
>     > To make things easy I've used the JavaMain action to execute the classes
>     > that the nutch scripts invokes, parametrized as necessary.
>     Ok. That means that each step (inject, generate, fetch, etc.) runs in its 
> own JVM. Right?
> 
>     > One thing that I noticed is that I found configuring the command line 
> arguments
>     > a tad cumbersome so: would it be unthinkable to adopt the Hadoop -D 
> configuration
>     > setting convention to set these options?
> 
>     All tool classes (Injector, Generator, Fetcher, ParseSegment, CrawlDb, 
> LinkDb, IndexingJob)
>     implement Tool and should process -D hadoop.property=value options. If 
> some classes do not,
>     please, report it in Jira, provide a patch and/or we'll fix it.
> 
>     > bash scripts could still hide the extra verbosity and preserve the 
> current args,
>     > while adding the option to define them in nutch-site.xml or in Oozie 
> under
>     > a more practical element.
> 
>     Can you explain what "extra verbosity" means?
> 
>     Thanks,
>     Sebastian
> 
> 
> 
>     2014-09-24 11:39 GMT+02:00 Edoardo Causarano <[email protected]
>     <mailto:[email protected]>>:
> 
>         Hi all,
> 
>         I've been busy lately with a Nutch 1.x setup and I've managed to 
> replicate the crawl script
>         into an Oozie workflow (and HUE for pretty web UI). To make things 
> easy I've used the
>         JavaMain action to execute the classes that the nutch scripts 
> invokes, parametrized as
>         necessary.
> 
>         One thing that I noticed is that I found configuring the command line 
> arguments a tad
>         cumbersome so: would it be unthinkable to adopt the Hadoop -D 
> configuration.setting
>         convention to set these options?
> 
>         bash scripts could still hide the extra verbosity and preserve the 
> current args, while
>         adding the option to define them in nutch-site.xml or in Oozie under 
> a more practical element.
> 
>         The patch wouldn't be too disruptive, but I don't want to do work 
> that wouldn't be folded
>         into upstream so let me know if such an approach flies in the face of 
> community wide
>         decisions and so on...
> 
> 
>         Best,
>         Edoardo          
> 
>         -- 
>         A Motto
>         Smile a while, and while you smile
>            another smiles
>         And soon there's miles and miles
>            of smiles
>         And life's worth while because
>            you smile
> 
> 
> 
> 
> 
> -- 
> A Motto
> Smile a while, and while you smile
>    another smiles
> And soon there's miles and miles
>    of smiles
> And life's worth while because
>    you smile
> 

Reply via email to