Hi folks,

I was talking with a colleague about a scenario he was facing and we were 
exploring the idea of Beam as a possible solution. I thought I’d reach out to 
the audience hear to get their opinions.

Basically, we have a series of single tenant Elasticsearch indices that we are 
attempting to read from, transform, and ultimately send to a Kafka topic to be 
consumed by some downstream multi-tenant Beam pipeline.

The current working thoughts are something to the effect of:
- Read all of the locations of the various tenant indices from a centralized 
location.
- Construct the appropriate transforms (perhaps just via a 
`configurations.forEach()` or some other pattern)
- Apply the transforms against the incoming data from Elastic (should be 
uniformly constricted in terms of schema)
- Send to a Kafka topic with the tenant identifier as part of the transform 
process

Does this seem like something Beam would be suitable for? These indices could 
be quite large and updated frequently (depending on the tenant), so I don’t 
know if watermarking should be a concern. Specifically watermarking for each 
tenant to avoid ingesting the same data or in cases of the services going down, 
being able to resume without lost data. I’m not aware of Beam or the Elastic 
connector having this notion, but I haven’t explored it in-depth.

Additionally, there are scaling concerns (e.g. if one particular tenant has a 
large amount of volume and others have very little are their mechanisms for 
handling that?). What if there were thousands of tenants? Could a single 
pipeline effectively handle that kind of volume?

Any thoughts or advice would be appreciated. I’d love to have the confidence to 
use Beam for this, but if it’s not the right tool I don’t want to fight it more 
than necessary.

Thanks,

Rion

Reply via email to