[ 
https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429067#comment-13429067
 ] 

Ferdy Galema commented on NUTCH-1445:
-------------------------------------

Hi Julien,

Agreed to wait a while before committing next time. In general I do wait for 
feedback on big patches, however I thought we should be able to make an 
exception for changes that only add new functionality. Of course only if it has 
been extensively tested and the chance for breaking stuff or changing existing 
behaviour is really minimal. Users that actively use the current branch will be 
able to use these features faster.

On topic: I think the plugable backends is a great idea. We indeed can discuss 
the details in NUTCH-1047. The addition of this elasticsearch indexer should 
make NUTCH-1047 a bit more interesting to implement, since there are now 
actually multiple backends.
                
> Add ElasticIndexerJob that indexes to elasticsearch
> ---------------------------------------------------
>
>                 Key: NUTCH-1445
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1445
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: NUTCH-1445-addPropsToConfig.patch, 
> NUTCH-1445-addToNutchScript.patch, NUTCH-1445.patch
>
>
> We have created a new indexer job ElasticIndexerJob that indexes to 
> elasticsearch. It is orginally based upon 
> https://github.com/ctjmorgan/nutch-elasticsearch-indexer (Apache2 license), 
> but we have modified it greatly to make it integrate as good as possible into 
> Nutch. The greatest modification is that documents are asynchronously flushed 
> in bulk to elasticsearch.
> Elasticsearch rocks. Both performance and ease of confiugration is awesome. 
> You simply deploy a server by unpacking the tar, configure the clustername, 
> start the server and fire away indexing requests. Indices are automatically 
> created. Fields are automapped. (Of course it is recommended to create your 
> own optimized mapping, but that is beyond scope of this issue). Multiple 
> servers connect without extra configuration, simply by using the same 
> clustername. (By means of multicast). There a tons of advanced options, such 
> as sharding, replication, disk striping etc.
> To give an example of the performance: With 20+ nodes we are able to index 
> over 1M docs (average sized webdocuments) per minute. The best part is that 
> the added documents are almost instantly searchable, so there no hidden 
> commit costs that Solr has. This is with out-of-the-box configuration.
> (I will attach patch and commit for Nutch2. Feel free to adapt for trunk.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to