[ 
https://issues.apache.org/jira/browse/BEAM-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-4389:
--------------------------------
    Description: 
Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
updates rather than full document inserts. 

Rationale: We have the case where different pipelines process different 
categories of information of the target entity (e.g. one for taxonomic 
processing, another for geospatial processing). A read and merge is not 
possible inside the batch call, meaning the only way to do it is through a 
join. The join approach is slow, and also stops the ability to run a single 
process in isolation (e.g. reprocess the geospatial component of all docs).

Use of this configuration parameter has to be used in conjunction with 
controlling the document ID (possible since BEAM-3201) to make sense.

The client API would include a {{withUpdateMode(...)}} such as:

{code}
source.apply(
  ElasticsearchIO.write()
    .withConnectionConfiguration(connectionConfiguration)
    .withIdFn(new ExtractValueFn("id"))
    .withUpdateMode(UpdateMode.PARTIAL)
{code}



  was:
Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
updates rather than full document inserts. 

Rationale: We have the case where different pipelines process different 
categories of information of the target entity (e.g. one for taxonomic 
processing, another for geospatial processing). A read and merge is not 
possible inside the batch call, meaning the only way to do it is through a 
join. The join approach is slow, and also stops the ability to run a single 
process in isolation (e.g. reprocess the geospatial component of all docs).

Use of this configuration parameter has to be used in conjunction with 
controlling the document ID (possible since BEAM-3201) to make sense.

The client API would include a {{withUsePartialUpdate(true)}} such as:

{code}
source.apply(
  ElasticsearchIO.write()
    .withConnectionConfiguration(connectionConfiguration)
    .withIdFn(new ExtractValueFn("id"))
    .withUpdateMode(UpdateMode.PARTIAL)
{code}




> Enable updates and upserts for Elasticsearch
> --------------------------------------------
>
>                 Key: BEAM-4389
>                 URL: https://issues.apache.org/jira/browse/BEAM-4389
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-java-elasticsearch
>    Affects Versions: 2.4.0
>            Reporter: Tim Robertson
>            Assignee: Tim Robertson
>            Priority: Major
>
> Expose a configuration option on the {{ElasticsearchIO}} to enable partial 
> updates rather than full document inserts. 
> Rationale: We have the case where different pipelines process different 
> categories of information of the target entity (e.g. one for taxonomic 
> processing, another for geospatial processing). A read and merge is not 
> possible inside the batch call, meaning the only way to do it is through a 
> join. The join approach is slow, and also stops the ability to run a single 
> process in isolation (e.g. reprocess the geospatial component of all docs).
> Use of this configuration parameter has to be used in conjunction with 
> controlling the document ID (possible since BEAM-3201) to make sense.
> The client API would include a {{withUpdateMode(...)}} such as:
> {code}
> source.apply(
>   ElasticsearchIO.write()
>     .withConnectionConfiguration(connectionConfiguration)
>     .withIdFn(new ExtractValueFn("id"))
>     .withUpdateMode(UpdateMode.PARTIAL)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to