[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

Julien Nioche (JIRA) Wed, 30 Jan 2013 00:41:20 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566291#comment-13566291
 ]


Julien Nioche commented on NUTCH-1047:
--------------------------------------

[~wastl-nagel] a text based indexer is a good idea. Having one generating data 
at the format used by CloudSearch see [NUTCH-1517] would be cool as well. As 
for your concerns : most people currently use the SOLR indexer which will still 
be the one activated by default. I expect a minority of people will try and use 
something else and if they do then checking which one is activated is no big 
deal, either via config file or from logs. Passing the options via the config 
with -D is not very different from using a standard parameter, with the added 
benefit though that it gives us the possibility to set things in nutch-site.xml 
once and for all and hence make the commands much simpler. As for the list of 
properties, they would vary from backend to backend anyway. Each plugin could 
have a README describing what its options are, compared to having everything in 
nutch-default.xml at least the descriptions will be contained within the 
related plugin.

[~tejasp] good catch for the number of args, will fix it. Re-usage message : we 
could add a getUsage()  method to each backend that the generic command will 
call for all the active indexing plugins. I think the solrindex shortcut is 
just a temporary measure though until the documentation is up to scratch and 
the user base has got used to the generic commands.

Thanks for taking the time to share your thoughts, guys. 

 

 
                
> Pluggable indexing backends
> ---------------------------
>
>                 Key: NUTCH-1047
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1047
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>              Labels: indexing
>             Fix For: 1.7
>
>         Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

Reply via email to