[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

Ferdy Galema (JIRA) Fri, 31 Aug 2012 04:33:14 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445849#comment-13445849
 ]


Ferdy Galema commented on NUTCH-1445:
-------------------------------------

Hi Matt,

Sure we can resolve your issue here. But for feature requests I think it's best 
to use the mailing list first. (From there it is always possible to refer to 
Jira issues or create a new one etc).

Looking at your logs, my first bet is that you are confusing 'cluster name' 
with 'node name'. I see that the indexing code is unable to contact a master: 
"waited for 30s and no initial state was set".

When you deploy Elasticsearch, normally you do not need to set node names, but 
just the clustername. (Node names are made up on the fly)

I see that your master is called 'Doppleganger' and 'elasticsearch_matt' is 
your clustername I guess. Use that as an argument for the indexer.

Ferdy.


                
> Add ElasticIndexerJob that indexes to elasticsearch
> ---------------------------------------------------
>
>                 Key: NUTCH-1445
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1445
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: NUTCH-1445-addPropsToConfig.patch, 
> NUTCH-1445-addToNutchScript.patch, NUTCH-1445.patch
>
>
> We have created a new indexer job ElasticIndexerJob that indexes to 
> elasticsearch. It is orginally based upon 
> https://github.com/ctjmorgan/nutch-elasticsearch-indexer (Apache2 license), 
> but we have modified it greatly to make it integrate as good as possible into 
> Nutch. The greatest modification is that documents are asynchronously flushed 
> in bulk to elasticsearch.
> Elasticsearch rocks. Both performance and ease of confiugration is awesome. 
> You simply deploy a server by unpacking the tar, configure the clustername, 
> start the server and fire away indexing requests. Indices are automatically 
> created. Fields are automapped. (Of course it is recommended to create your 
> own optimized mapping, but that is beyond scope of this issue). Multiple 
> servers connect without extra configuration, simply by using the same 
> clustername. (By means of multicast). There a tons of advanced options, such 
> as sharding, replication, disk striping etc.
> To give an example of the performance: With 20+ nodes we are able to index 
> over 1M docs (average sized webdocuments) per minute. The best part is that 
> the added documents are almost instantly searchable, so there no hidden 
> commit costs that Solr has. This is with out-of-the-box configuration.
> (I will attach patch and commit for Nutch2. Feel free to adapt for trunk.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

Reply via email to