egalpin commented on issue #22840: URL: https://github.com/apache/beam/issues/22840#issuecomment-1225707383
@sheepdreamofandroids Something that exists today which might work for you
would be to split apart your usage of the building blocks of
`ElasticsearchIO#Write`: `DocToBulk`[1] and `BulkIO`[2]. DocToBulk is
responsible for taking JSON-serialized inputs and converting them to a
representation that the ES Bulk API can work with. BulkIO strictly deals with
batching and sending data to an ES cluster.
In your use case, it sounds like you have a singular input PCollection of
inputs which then need to fanout in order to be processed in multiple ways for
inclusion in different indices in ES. You could fanout to each DocToBulk to
process as needed, then flatten and use a single BulkIO operation. This would
allow for larger Bulk API payloads/larger batches because outputs from all
DocToBulk could be combined in a single BulkIO output to ES (depending on
buffering time, of course).
```
┌──────────────────────┐
│
│
│
Input PCollection │
┌─────────────────────────────┬─────────────────────┴──────────────────────┴─────────────────────────────┐
│ │
│
│ │
│
│ │
│
│ │
│
│ │
│
┌──────────▼──────────┐ ┌────────▼──────┐
┌────────▼────────┐
│ DocToBulk1 │ │ DocToBulk2 │
│ DocToBulk_n │
└────────┬────────────┘ └───────────────┴───────────────────┐
└────────┬────────┘
│ │
│
│ │
│
│ │
│
│ │
│
│ │
│
│ │
│
│
┌─────────────▼─────────────┐ │
└───────────────────────────────────────────►│ Flatten
◄─────────────────────────────────┘
│
│
└─────────────┬─────────────┘
│
│
│
┌─────────────▼─────────────┐
│ BulkIO
│
└───────────────────────────┘
```
[1]
https://beam.apache.org/releases/javadoc/2.40.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.DocToBulk.html
[2]
https://beam.apache.org/releases/javadoc/2.40.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.BulkIO.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
