[
https://issues.apache.org/jira/browse/BEAM-12093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Evan Galpin updated BEAM-12093:
-------------------------------
Fix Version/s: 2.31.0
> Overhaul ElasticsearchIO#Write
> ------------------------------
>
> Key: BEAM-12093
> URL: https://issues.apache.org/jira/browse/BEAM-12093
> Project: Beam
> Issue Type: Improvement
> Components: io-java-elasticsearch
> Reporter: Evan Galpin
> Assignee: Evan Galpin
> Priority: P2
> Labels: elasticsearch
> Fix For: 2.31.0
>
> Time Spent: 16h 20m
> Remaining Estimate: 0h
>
> The current ElasticsearchIO#Write is great, but there are two related areas
> which could be improved:
> # Separation of concern
> # Bulk API batch size optimization
>
> Presently, the Write transform has 2 responsibilities which are coupled and
> inseparable by users:
> # Convert input documents into Bulk API entities, serializing based on user
> settings (partial update, delete, upsert, etc)
> # Batch the converted Bulk API entities together and interface with the
> target ES cluster
>
> Having these 2 roles tightly coupled means testing requires an available
> Elasticsearch cluster, making unit testing almost impossible. Allowing access
> to the serialized documents would make unit testing much easier for pipeline
> developers, among numerous other benefits to having separation between
> serialization and IO.
> Relatedly, the batching of entities when creating Bulk API payloads is
> currently limited by the lesser of Beam Runner bundling semantics, and the
> `ElasticsearchIO#Write#maxBatchSize` setting. This is understandable for
> portability between runners, but it also means most Bulk payloads only have a
> few (1-5) entities. By using Stateful Processing to better adhere to the
> `ElasticsearchIO#Write#maxBatchSize` setting, we have been able to drop the
> number of indexing requests in an Elasticsearch cluster by 50-100x.
> Separating the role of document serialization and IO allows supporting
> multiple IO techniques with minimal and understandable code.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)