[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

Julien Nioche (JIRA) Mon, 28 Jan 2013 02:15:16 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564173#comment-13564173
 ]


Julien Nioche commented on NUTCH-1047:
--------------------------------------

@tejasp can reproduce the issue and am looking into it, thanks. Somehow the 
configuration does not get passed on properly when using the crawl command. 
Thanks.

Lufeng 
{quote}
But i don't know why not add an option to set IndexerUrl such as bin/nutch 
solrindex -indexurl http://localhost:8983/solr/.
{quote}

whether it is passed as a parameter or via configuration should not make much 
of a difference. Your suggestion also assumes that the indexing backend can be 
reached via a single URL which is not necessarily the case as it could not need 
a URL at all or at the opposite need multiple URLs. Better to leave that logic 
in the configuration and assume that the backends will find whatever they need 
there.

{quote}
 the corrent command to invoke the IndexingJob command is "bin/nutch solrindex 
http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter".
{quote}

as explained above we want to keep compatibility with the existing sorlindex 
command and not change its syntax. Underneath it uses the new code based on 
plugins but sets the value of the solr config. There is no shortcut for the 
generic indexing job command in the nutch script yet but we could add one. For 
now it has to be called in full e.g. bin/nutch 
org.apache.nutch.indexer.IndexingJob ... which will make sense when we have 
other indexing backends and not just SOLR.

Think about 'nutch solrindex' as a shortcut for the generic command.






                
> Pluggable indexing backends
> ---------------------------
>
>                 Key: NUTCH-1047
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1047
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>              Labels: indexing
>             Fix For: 1.7
>
>         Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

Reply via email to