[ 
https://issues.apache.org/jira/browse/BEAM-12093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Galpin updated BEAM-12093:
-------------------------------
    Fix Version/s: 2.31.0

> Overhaul ElasticsearchIO#Write
> ------------------------------
>
>                 Key: BEAM-12093
>                 URL: https://issues.apache.org/jira/browse/BEAM-12093
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-elasticsearch
>            Reporter: Evan Galpin
>            Assignee: Evan Galpin
>            Priority: P2
>              Labels: elasticsearch
>             Fix For: 2.31.0
>
>          Time Spent: 16h 20m
>  Remaining Estimate: 0h
>
> The current ElasticsearchIO#Write is great, but there are two related areas 
> which could be improved:
>  # Separation of concern
>  # Bulk API batch size optimization
>  
> Presently, the Write transform has 2 responsibilities which are coupled and 
> inseparable by users:
>  # Convert input documents into Bulk API entities, serializing based on user 
> settings (partial update, delete, upsert, etc)
>  # Batch the converted Bulk API entities together and interface with the 
> target ES cluster
>  
> Having these 2 roles tightly coupled means testing requires an available 
> Elasticsearch cluster, making unit testing almost impossible. Allowing access 
> to the serialized documents would make unit testing much easier for pipeline 
> developers, among numerous other benefits to having separation between 
> serialization and IO.
> Relatedly, the batching of entities when creating Bulk API payloads is 
> currently limited by the lesser of Beam Runner bundling semantics, and the 
> `ElasticsearchIO#Write#maxBatchSize` setting. This is understandable for 
> portability between runners, but it also means most Bulk payloads only have a 
> few (1-5) entities. By using Stateful Processing to better adhere to the 
> `ElasticsearchIO#Write#maxBatchSize` setting, we have been able to drop the 
> number of indexing requests in an Elasticsearch cluster by 50-100x. 
> Separating the role of document serialization and IO allows supporting 
> multiple IO techniques with minimal and understandable code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to