[
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496426#comment-14496426
]
Michael Joyce edited comment on NUTCH-1987 at 4/15/15 3:54 PM:
---------------------------------------------------------------
Hi folks,
I'll have a patch up in a bit for this. I think my current plan to minimize the
number of changes that I'm shoving into a single patch is to:
* Add solr.server.url to nutch-default and set the value to some sane default
(http://127.0.0.1:8983/solr/)
* Make the 'index' calls in the bin/nutch script generic and slightly change
the call format.
* Update some variable names and echos in the bin/crawl script so it doesn't
only mention Solr and confuse people
I envision a call being something similar to this after these changes:
{code}
# Run the indexer
bin/crawl urls/ crawl/ "run_indexer" 1
# Don't run the indexer
bin/crawl urls/ crawl/ 1
{code}
I don't think this is necessarily the ideal solution but it minimizes call
format changes for people with existing setups and only really requires that a
single configuration value is added/updated if you want to keep using Solr on
an existing setup. Note, this change obviously requires documentation updates.
I'm more than happy to help with those as well but I wasn't including them in
this ticket.
Thoughts?
was (Author: mjoyce):
Hi folks,
I'll have a patch up in a bit for this. I think my current plan to minimize the
number of changes that I'm shoving into a single patch is to:
* Add solr.server.url to nutch-default and set the value to some sane default
(http://127.0.0.1:8983/solr/)
* Make the 'index' calls in the bin/nutch script generic and slightly change
the call format.
* Update some variable names and echos in the bin/crawl script so it doesn't
only mention Solr and confuse people
I envision a call being something similar to this after these changes:
{code}
# Run the indexer
bin/crawl urls/ crawl/ "run_indexer" 1
# Don't run the indexer
bin/crawl urls/ crawl/ 1
{code}
I don't think this is necessarily the ideal solution but it minimizes calling
formats for people with existing setups and only really requires that a single
configuration value is added/updated. Note, this change obviously requires
some/many documentation updates. I'm more than happy to help with those as well
but I wasn't including them in this ticket.
Thoughts?
> Make bin/crawl indexer agnostic
> -------------------------------
>
> Key: NUTCH-1987
> URL: https://issues.apache.org/jira/browse/NUTCH-1987
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.9
> Reporter: Michael Joyce
> Fix For: 1.10
>
>
> The crawl script makes it a bit challenging to use an indexer that isn't
> Solr. For instance, when I want to use the indexer-elastic plugin I still
> need to call the crawler script with a fake Solr URL otherwise it will skip
> the indexing step all together.
> {code}
> bin/crawl urls/ crawl/ "http://fakeurl.com:9200" 1
> {code}
> It would be nice to keep configuration for the Solr indexer in the conf files
> (to mirror the elastic search indexer conf and others) and to make the
> indexing parameter simply toggle whether indexing does or doesn't occur
> instead of also trying to configure the indexer at the same time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)