[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429093#comment-13429093
 ] 

Ferdy Galema commented on NUTCH-1047:
-------------------------------------

Changing NutchIndexWriter into an endpoint looks like the best solution to have 
a pluggable indexing backend.

> "What is not clear yet is what IndexerOutputFormat is used for"
More or less what it is used now for? (A bridge for mapreduce code to write 
documents to indexwriters). What I've changed in Nutch2.x is that 
IndexerOutputFormat does not extend from FileOutputFormat anymore. (Since many 
indexers do not use the filesystem at all and the temporary files that were 
written anyway are unnecessary). When there will be a filebased implementation 
again (like the above mentioned XML output indexer) it is always possible to 
introduce an abstract indexwriter that is used a base for backends that uses 
the filesystem, i.e. FileIndexWriter or something like that. Open for 
discussion.

One thing I noticed is that Nutch trunk still uses the old mapreduce API. (Note 
NUTCH-1219). It is not really a blocker, but since Nutchgora is using the new 
API, it will cause some differences in implementation for trunk and Nutch2. For 
now I think it would be okay to ignore Nutch2 and make an implementation for 
trunk first. (I'm happy to make a port to Nutch2 afterwards).

> "whether we will be able to use implementations of NutchIndexWriter from 
> within a plugin"
What do you mean with this?
                
> Pluggable indexing backends
> ---------------------------
>
>                 Key: NUTCH-1047
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1047
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>              Labels: indexing
>             Fix For: 1.6
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to