Hi Sebastian,

take org.apache.nutch.crawl.Generator for example. The way the class is
written, the main method will pass the command line options to the actual
run method and if crawldb and segments paths are missing it will bail out.
Other command line parameters are parsed in the run method to possibly
override other site defined options.

I have the feeling this is a bit redundant, and we could just define
nutch.crawldb.path and nutch.segments.path options, and pass them - as well
as all the other - to the hadoop runner as -D options (granted this could
result in a pretty long command, but not for the script user who should
still rely on the current usage syntax.)

The bash script can be tweaked to preserve the same command line syntax,
and just generate a list of -D option overrides for the final hadoop
invocation.

This way one can define all necessary paths in the site.xml file or as
Hadoop options in HUE/Oozie (as opposed to parameterized command line
arguments in every workflow step.)

In the end the main methods (or run() ) wouldn't do any command line
parsing and just rely on the set of -D options defined externally (either
by the bash script, oozie workflow, etc...)

What do you think?


Best,
Edoardo

On Wed, Sep 24, 2014 at 3:34 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi Edoardo,
>
> > To make things easy I've used the JavaMain action to execute the classes
> > that the nutch scripts invokes, parametrized as necessary.
> Ok. That means that each step (inject, generate, fetch, etc.) runs in its
> own JVM. Right?
>
> > One thing that I noticed is that I found configuring the command line
> arguments
> > a tad cumbersome so: would it be unthinkable to adopt the Hadoop -D
> configuration
> > setting convention to set these options?
>
> All tool classes (Injector, Generator, Fetcher, ParseSegment, CrawlDb,
> LinkDb, IndexingJob)
> implement Tool and should process -D hadoop.property=value options. If
> some classes do not, please, report it in Jira, provide a patch and/or
> we'll fix it.
>
> > bash scripts could still hide the extra verbosity and preserve the
> current args,
> > while adding the option to define them in nutch-site.xml or in Oozie
> under
> > a more practical element.
>
> Can you explain what "extra verbosity" means?
>
> Thanks,
> Sebastian
>
>
>
> 2014-09-24 11:39 GMT+02:00 Edoardo Causarano <[email protected]>
> :
>
>> Hi all,
>>
>> I've been busy lately with a Nutch 1.x setup and I've managed to
>> replicate the crawl script into an Oozie workflow (and HUE for pretty web
>> UI). To make things easy I've used the JavaMain action to execute the
>> classes that the nutch scripts invokes, parametrized as necessary.
>>
>> One thing that I noticed is that I found configuring the command line
>> arguments a tad cumbersome so: would it be unthinkable to adopt the Hadoop
>> -D configuration.setting convention to set these options?
>>
>> bash scripts could still hide the extra verbosity and preserve the
>> current args, while adding the option to define them in nutch-site.xml or
>> in Oozie under a more practical element.
>>
>> The patch wouldn't be too disruptive, but I don't want to do work that
>> wouldn't be folded into upstream so let me know if such an approach flies
>> in the face of community wide decisions and so on...
>>
>>
>> Best,
>> Edoardo
>>
>> --
>> A Motto
>> Smile a while, and while you smile
>>    another smiles
>> And soon there's miles and miles
>>    of smiles
>> And life's worth while because
>>    you smile
>>
>>
>


-- 
A Motto
Smile a while, and while you smile
   another smiles
And soon there's miles and miles
   of smiles
And life's worth while because
   you smile

Reply via email to