Evan Galpin created BEAM-12093:
----------------------------------

             Summary: Overhaul ElasticsearchIO#Write
                 Key: BEAM-12093
                 URL: https://issues.apache.org/jira/browse/BEAM-12093
             Project: Beam
          Issue Type: Improvement
          Components: io-java-elasticsearch
            Reporter: Evan Galpin


The current ElasticsearchIO#Write is great, but there are two related areas 
which could be improved:
 # Separation of concern
 # Bulk API batch size optimization

 

Presently, the Write transform has 2 responsibilities which are coupled and 
inseparable by users:
 # Convert input documents into Bulk API entities, serializing based on user 
settings (partial update, delete, upsert, etc)
 # Batch the converted Bulk API entities together and interface with the target 
ES cluster

 

Having these 2 roles tightly coupled means testing requires an available 
Elasticsearch cluster, making unit testing almost impossible. Allowing access 
to the serialized documents would make unit testing much easier for pipeline 
developers, among numerous other benefits to having separation between 
serialization and IO.

Relatedly, the batching of entities when creating Bulk API payloads is 
currently limited by the lesser of Beam Runner bundling semantics, and the 
`ElasticsearchIO#Write#maxBatchSize` setting. This is understandable for 
portability between runners, but it also means most Bulk payloads only have a 
few (1-5) entities. By using Stateful Processing to better adhere to the 
`ElasticsearchIO#Write#maxBatchSize` setting, we have been able to drop the 
number of indexing requests in an Elasticsearch cluster by 50-100x. Separating 
the role of document serialization and IO allows supporting multiple IO 
techniques with minimal and understandable code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to