[
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487085#comment-16487085
]
Tim Robertson edited comment on BEAM-4389 at 5/23/18 11:37 AM:
---------------------------------------------------------------
Thanks for the quick reply [~echauchot]
The {{withUsePartialUpdate(true)}} would simply change the {{bulk}} list sent
to ES to have {{update}} instead of {{index}} operations. Server side
Elasticsearch treats this as a "get document, apply edits, save document"
operation.
In our code I think it would be something as simple as exposing the
configuration toggle and changing:
{code}
batch.add(String.format("{ \"index\" : %s }%n%s%n", documentAddress,
document));
{code}
to
{code}
String operation = spec.isPartialUpdate() ? "update" : "index";
batch.add(String.format("{ \"%s\" : %s }%n%s%n", operation, documentAddress,
document));
{code}
New fields being introduced and schema compatibility seem no different to the
current model (you can push nonsense JSON to a live Elasticsearch using today).
Or am I overlooking something please?
Edited to add: I'd probably include a {{"_retry_on_conflict" : 5}} or similar
for the updates as well
was (Author: timrobertson100):
Thanks for the quick reply [~echauchot]
The {{withUsePartialUpdate(true)}} would simply change the {{bulk}} list sent
to ES to have {{update}} instead of {{index}} operations. Server side
Elasticsearch treats this as a "get document, apply edits, save document"
operation.
In our code I think it would be something as simple as exposing the
configuration toggle and changing:
{code}
batch.add(String.format("{ \"index\" : %s }%n%s%n", documentAddress,
document));
{code}
to
{code}
String operation = spec.isPartialUpdate() ? "update" : "index";
batch.add(String.format("{ \"%s\" : %s }%n%s%n", operation, documentAddress,
document));
{code}
New fields being introduced and schema compatibility seem no different to the
current model (you can push nonsense JSON to a live Elasticsearch using today).
Or am I overlooking something please?
Edited to add: I'd probably include a {{"_retry_on_conflict" : 5}} or similar
as well
> Enable partial updates for Elasticsearch
> ----------------------------------------
>
> Key: BEAM-4389
> URL: https://issues.apache.org/jira/browse/BEAM-4389
> Project: Beam
> Issue Type: New Feature
> Components: io-java-elasticsearch
> Affects Versions: 2.4.0
> Reporter: Tim Robertson
> Assignee: Tim Robertson
> Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial
> updates rather than full document inserts.
> Rationale: We have the case where different pipelines process different
> categories of information of the target entity (e.g. one for taxonomic
> processing, another for geospatial processing). A read and merge is not
> possible inside the batch call, meaning the only way to do it is through a
> join. The join approach is slow, and also stops the ability to run a single
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUsePartialUpdate(true)}} such as:
> {code}
> source.apply(
> ElasticsearchIO.write()
> .withConnectionConfiguration(connectionConfiguration)
> .withIdFn(new ExtractValueFn("id"))
> .withUsePartialUpdate(true)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)