[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163481#comment-13163481
]
Julien Nioche commented on NUTCH-1047:
--------------------------------------
bq. If you'd need WARC files, for some reason, i'd rather have an endpoint for
it just like for ES and Solr instead of using WARC files as an intermediate
format.
bq. Does your suggestion imply: segment+crawldb > warc files > search engine?
Nope,let's start again :-)
We mentioned in this issue that we'd like to make the indexing backends
pluggable in order to simplify the code and make it easier for others to
implement alternative backends. We currently have only SOLR, ES is clearly a
good candidate and you've rightly pointed out that we could have a XML dump of
the docs. I would add that we could plug in JDBC or HBase etc... WARC is just
another example of something we could have as a plugin.
The question was : is there a functional difference between say [XML|WARC] and
[SOLR|ES]? For instance the plugin endpoint for SOLR|ES would need to handle
deletetions, not the XML or WARC one. Are there any more such differences? Is
is an index vs dump issue? A remote vs local one? Would it make sense to have
on one hand an indexer with plugins supporting deletions and expecting a URL
and on the other a separate job for converting segments and crawldb to XML,
WARC etc...
Does it make more sense?
> Pluggable indexing backends
> ---------------------------
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
> Issue Type: New Feature
> Components: indexer
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Labels: indexing
> Fix For: 1.5
>
>
> One possible feature would be to add a new endpoint for indexing-backends and
> make the indexing plugable. at the moment we are hardwired to SOLR - which is
> OK - but as other resources like ElasticSearch are becoming more popular it
> would be better to handle this as plugins. Not sure about the name of the
> endpoint though : we already have indexing-plugins (which are about
> generating fields sent to the backends) and moreover the backends are not
> necessarily for indexing / searching but could be just an external storage
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this
> could be pertaining to the storage in GORA. 'indexing-backend' is the best
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating
> and cleaning and maybe add a Nutch extension point there so we can easily
> hook up indexing, cleaning and deduplicating for various backends.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira