[
https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429058#comment-13429058
]
Julien Nioche commented on NUTCH-1445:
--------------------------------------
Ferdy - just to reiterate what was said on a previous issue : please give
people time to review your contribs before committing your own stuff. I am sure
your code is fine and it does not really affect existing code too much but I
think it is a good practice that we should try and stick to.
Instead of having multiple commands for the indexing backends can't we have a
single job and define what the backends (SOLR, ES) via configuration? There is
an open issue on 'pluggable indexing backends'
[https://issues.apache.org/jira/browse/NUTCH-1047] can we discuss this there?
> Add ElasticIndexerJob that indexes to elasticsearch
> ---------------------------------------------------
>
> Key: NUTCH-1445
> URL: https://issues.apache.org/jira/browse/NUTCH-1445
> Project: Nutch
> Issue Type: New Feature
> Reporter: Ferdy Galema
> Fix For: 2.1
>
> Attachments: NUTCH-1445-addPropsToConfig.patch,
> NUTCH-1445-addToNutchScript.patch, NUTCH-1445.patch
>
>
> We have created a new indexer job ElasticIndexerJob that indexes to
> elasticsearch. It is orginally based upon
> https://github.com/ctjmorgan/nutch-elasticsearch-indexer (Apache2 license),
> but we have modified it greatly to make it integrate as good as possible into
> Nutch. The greatest modification is that documents are asynchronously flushed
> in bulk to elasticsearch.
> Elasticsearch rocks. Both performance and ease of confiugration is awesome.
> You simply deploy a server by unpacking the tar, configure the clustername,
> start the server and fire away indexing requests. Indices are automatically
> created. Fields are automapped. (Of course it is recommended to create your
> own optimized mapping, but that is beyond scope of this issue). Multiple
> servers connect without extra configuration, simply by using the same
> clustername. (By means of multicast). There a tons of advanced options, such
> as sharding, replication, disk striping etc.
> To give an example of the performance: With 20+ nodes we are able to index
> over 1M docs (average sized webdocuments) per minute. The best part is that
> the added documents are almost instantly searchable, so there no hidden
> commit costs that Solr has. This is with out-of-the-box configuration.
> (I will attach patch and commit for Nutch2. Feel free to adapt for trunk.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira