Joseph Naegele created NUTCH-2287:
-------------------------------------
Summary: Indexer-elastic plugin should use Elasticsearch
BulkProcessor and BackoffPolicy
Key: NUTCH-2287
URL: https://issues.apache.org/jira/browse/NUTCH-2287
Project: Nutch
Issue Type: Improvement
Components: indexer, plugin
Affects Versions: 1.12
Reporter: Joseph Naegele
Elasticsearch's API (since at least v2.0) includes the {{BulkProcessor}}, which
automatically handles flushing bulk requests given a max doc count and/or max
bulk size. It also now (I believe since 2.2.0) offers a {{BackoffPolicy}}
option, allowing the BulkProcessor/Client to retry bulk requests when the
Elasticsearch cluster is saturated. Using the {{BulkProcessor}} was originally
suggested
[here|https://issues.apache.org/jira/browse/NUTCH-1527?focusedCommentId=13666616&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13666616].
Refactoring the {{indexer-elastic}} plugin to use the {{BulkProcessor}} will
greatly simplify the existing plugin at the cost of slightly less debug
logging. Additionally, it will allow the plugin to handle cluster saturation
gracefully (rather than raising a RuntimeException and killing the reduce
task), by using a configurable "exponential back-off policy".
https://www.elastic.co/guide/en/elasticsearch/client/java-api/2.3/java-docs-bulk-processor.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)