[
https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445850#comment-13445850
]
Ferdy Galema commented on NUTCH-1445:
-------------------------------------
("feature requests" should be "future requests" ofc.)
> Add ElasticIndexerJob that indexes to elasticsearch
> ---------------------------------------------------
>
> Key: NUTCH-1445
> URL: https://issues.apache.org/jira/browse/NUTCH-1445
> Project: Nutch
> Issue Type: New Feature
> Reporter: Ferdy Galema
> Fix For: 2.1
>
> Attachments: NUTCH-1445-addPropsToConfig.patch,
> NUTCH-1445-addToNutchScript.patch, NUTCH-1445.patch
>
>
> We have created a new indexer job ElasticIndexerJob that indexes to
> elasticsearch. It is orginally based upon
> https://github.com/ctjmorgan/nutch-elasticsearch-indexer (Apache2 license),
> but we have modified it greatly to make it integrate as good as possible into
> Nutch. The greatest modification is that documents are asynchronously flushed
> in bulk to elasticsearch.
> Elasticsearch rocks. Both performance and ease of confiugration is awesome.
> You simply deploy a server by unpacking the tar, configure the clustername,
> start the server and fire away indexing requests. Indices are automatically
> created. Fields are automapped. (Of course it is recommended to create your
> own optimized mapping, but that is beyond scope of this issue). Multiple
> servers connect without extra configuration, simply by using the same
> clustername. (By means of multicast). There a tons of advanced options, such
> as sharding, replication, disk striping etc.
> To give an example of the performance: With 20+ nodes we are able to index
> over 1M docs (average sized webdocuments) per minute. The best part is that
> the added documents are almost instantly searchable, so there no hidden
> commit costs that Solr has. This is with out-of-the-box configuration.
> (I will attach patch and commit for Nutch2. Feel free to adapt for trunk.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira