[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

Matt MacDonald (JIRA) Fri, 31 Aug 2012 05:00:12 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445860#comment-13445860
 ]


Matt MacDonald commented on NUTCH-1445:
---------------------------------------

Ferdy,

Thanks for the help. I'll definitely use the mailing list first for this type 
of question/issue in the future.

I was confusing 'cluster name' with 'node name' when invoking the elasticindex 
command, but I am still seeing the same issue after making your suggested 
change. As a result of using the proper cluster name, I can now see that the 
index is added to ElasticSearch during the ElasticIndexerJob, but then it is 
removed when the error is encountered:

*Index added and removed from ElasticSearch*
{noformat}
[2012-08-31 07:38:59,073][INFO ][cluster.service          ] [Doorman] added 
{[Atleza][KBhEZMZEQqmoSALKkYLprw][inet[/192.168.1.133:9302]]{client=true, 
data=false},}, reason: zen-disco-receive(from master 
[[Doppleganger][OF5TWSbpTl64qA0_VW-b_g][inet[/192.168.1.133:9300]]])
[2012-08-31 07:39:01,140][INFO ][cluster.service          ] [Doorman] removed 
{[Atleza][KBhEZMZEQqmoSALKkYLprw][inet[/192.168.1.133:9302]]{client=true, 
data=false},}, reason: zen-disco-receive(from master 
[[Doppleganger][OF5TWSbpTl64qA0_VW-b_g][inet[/192.168.1.133:9300]]])
{noformat}

*Still seeing the same error message of*
{noformat}
2012-08-31 07:38:59,990 WARN  mapred.LocalJobRunner - job_local_0001
org.elasticsearch.action.ActionRequestValidationException: Validation Failed: 
1: type is missing;
{noformat}

Thanks,
Matt
                
> Add ElasticIndexerJob that indexes to elasticsearch
> ---------------------------------------------------
>
>                 Key: NUTCH-1445
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1445
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: NUTCH-1445-addPropsToConfig.patch, 
> NUTCH-1445-addToNutchScript.patch, NUTCH-1445.patch
>
>
> We have created a new indexer job ElasticIndexerJob that indexes to 
> elasticsearch. It is orginally based upon 
> https://github.com/ctjmorgan/nutch-elasticsearch-indexer (Apache2 license), 
> but we have modified it greatly to make it integrate as good as possible into 
> Nutch. The greatest modification is that documents are asynchronously flushed 
> in bulk to elasticsearch.
> Elasticsearch rocks. Both performance and ease of confiugration is awesome. 
> You simply deploy a server by unpacking the tar, configure the clustername, 
> start the server and fire away indexing requests. Indices are automatically 
> created. Fields are automapped. (Of course it is recommended to create your 
> own optimized mapping, but that is beyond scope of this issue). Multiple 
> servers connect without extra configuration, simply by using the same 
> clustername. (By means of multicast). There a tons of advanced options, such 
> as sharding, replication, disk striping etc.
> To give an example of the performance: With 20+ nodes we are able to index 
> over 1M docs (average sized webdocuments) per minute. The best part is that 
> the added documents are almost instantly searchable, so there no hidden 
> commit costs that Solr has. This is with out-of-the-box configuration.
> (I will attach patch and commit for Nutch2. Feel free to adapt for trunk.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

Reply via email to