Ferdy Galema created NUTCH-1445:
-----------------------------------

             Summary: Add ElasticIndexerJob that indexes to elasticsearch
                 Key: NUTCH-1445
                 URL: https://issues.apache.org/jira/browse/NUTCH-1445
             Project: Nutch
          Issue Type: New Feature
            Reporter: Ferdy Galema
             Fix For: 2.1


We have created a new indexer job ElasticIndexerJob that indexes to 
elasticsearch. It is orginally based upon 
https://github.com/ctjmorgan/nutch-elasticsearch-indexer (Apache2 license), but 
we have modified it greatly to make it integrate as good as possible into 
Nutch. The greatest modification is that documents are asynchronously flushed 
in bulk to elasticsearch.

Elasticsearch rocks. Both performance and ease of confiugration is awesome. You 
simply deploy a server by unpacking the tar, configure the clustername, start 
the server and fire away indexing requests. Indices are automatically created. 
Fields are automapped. (Of course it is recommended to create your own 
optimized mapping, but that is beyond scope of this issue). Multiple servers 
connect without extra configuration, simply by using the same clustername. (By 
means of multicast). There a tons of advanced options, such as sharding, 
replication, disk striping etc.

To give an example of the performance: With 20+ nodes we are able to index over 
1M docs (average sized webdocuments) per minute. The best part is that the 
added documents are almost instantly searchable, so there no hidden commit 
costs that Solr has. This is with out-of-the-box configuration.

(I will attach patch and commit for Nutch2. Feel free to adapt for trunk.)



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to