[jira] [Comment Edited] (NUTCH-1987) Make bin/crawl indexer agnostic

Michael Joyce (JIRA) Wed, 15 Apr 2015 08:56:05 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496426#comment-14496426
 ]


Michael Joyce edited comment on NUTCH-1987 at 4/15/15 3:54 PM:
---------------------------------------------------------------

Hi folks,

I'll have a patch up in a bit for this. I think my current plan to minimize the 
number of changes that I'm shoving into a single patch is to:

* Add solr.server.url to nutch-default and set the value to some sane default 
(http://127.0.0.1:8983/solr/)
* Make the 'index' calls in the bin/nutch script generic and slightly change 
the call format.
* Update some variable names and echos in the bin/crawl script so it doesn't 
only mention Solr and confuse people

I envision a call being something similar to this after these changes:
{code}
# Run the indexer
bin/crawl urls/ crawl/ "run_indexer" 1

# Don't run the indexer
bin/crawl urls/ crawl/ 1
{code}

I don't think this is necessarily the ideal solution but it minimizes call 
format changes for people with existing setups and only really requires that a 
single configuration value is added/updated if you want to keep using Solr on 
an existing setup. Note, this change obviously requires documentation updates. 
I'm more than happy to help with those as well but I wasn't including them in 
this ticket.

Thoughts?


was (Author: mjoyce):
Hi folks,

I'll have a patch up in a bit for this. I think my current plan to minimize the 
number of changes that I'm shoving into a single patch is to:

* Add solr.server.url to nutch-default and set the value to some sane default 
(http://127.0.0.1:8983/solr/)
* Make the 'index' calls in the bin/nutch script generic and slightly change 
the call format.
* Update some variable names and echos in the bin/crawl script so it doesn't 
only mention Solr and confuse people

I envision a call being something similar to this after these changes:
{code}
# Run the indexer
bin/crawl urls/ crawl/ "run_indexer" 1

# Don't run the indexer
bin/crawl urls/ crawl/ 1
{code}

I don't think this is necessarily the ideal solution but it minimizes calling 
formats for people with existing setups and only really requires that a single 
configuration value is added/updated. Note, this change obviously requires 
some/many documentation updates. I'm more than happy to help with those as well 
but I wasn't including them in this ticket.

Thoughts?

> Make bin/crawl indexer agnostic
> -------------------------------
>
>                 Key: NUTCH-1987
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1987
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.9
>            Reporter: Michael Joyce
>             Fix For: 1.10
>
>
> The crawl script makes it a bit challenging to use an indexer that isn't 
> Solr. For instance, when I want to use the indexer-elastic plugin I still 
> need to call the crawler script with a fake Solr URL otherwise it will skip 
> the indexing step all together.
> {code}
> bin/crawl urls/ crawl/ "http://fakeurl.com:9200"; 1
> {code}
> It would be nice to keep configuration for the Solr indexer in the conf files 
> (to mirror the elastic search indexer conf and others) and to make the 
> indexing parameter simply toggle whether indexing does or doesn't occur 
> instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (NUTCH-1987) Make bin/crawl indexer agnostic

Reply via email to