[GitHub] [beam] lostluck commented on issue #25212: [Feature Request]: Support streaming (pubsub) in RedisIO connector

via GitHub Wed, 26 Apr 2023 09:00:46 -0700


lostluck commented on issue #25212:
URL: https://github.com/apache/beam/issues/25212#issuecomment-1523668920

Hi @shhivam ! Great question! I've been meaning to write a more direct
tutorial about this, but lately I've been focused on the Go Direct Runner
replacement,
[Prism](https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/runners/prism)
and helping get Timers into the SDK. Here's what I think. Please do feel free
to ask questions, and do @ mention me to get my attention. I can't promise I'll
reply fast, but I do try to help others go

There's one way to create an unbounded source: An [Splittable
DoFn](https://beam.apache.org/documentation/programming-guide/#splittable-dofns)
that also returns [process
continuations](https://beam.apache.org/documentation/programming-guide/#user-initiated-checkpoint)
and critically: provides [watermark
estimates.](https://beam.apache.org/documentation/programming-guide/#watermark-estimation).

Alternatively, the 2.47.0 release will include a
[`periodic.Impulse`](https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/transforms/periodic/periodic.go#L118)
transform which will, "impulse" downstream transforms, periodically, making
them unbounded. It can be thought of as a streaming replacement to
`beam.Impulse`.
It also serves as an good example of how to write an unbounded source. Note
that while we use the term "unbounded" they can in fact terminate if there's no
more work, which allows a streaming execution pipeline to gracefully terminate
if desired.

For any kind of "database" model, I'd recommend starting with a batch source
whose input elements are configuration objects for the datastore.

For a pubsub approach, it is a little trickier to manage being able to split
/ scale horizontally, and the appropriate ways to scale up and down depend on
the underlying message broker's model. But as a first approximation, using
periodic impulse to be able to "poll" whatever subscription is created for the
job, seems reasonable. You'll also want to use [Bundle
Finalization](https://beam.apache.org/documentation/programming-guide/#bundle-finalization)
to perform whatever "Acknlowledgement" the broker requires.

It looks like [Redis uses At-Most-Once
semantics](https://redis.io/docs/manual/pubsub/#delivery-semantics) for Pub
Sub, ( unless it's using [Redis
streams](https://redis.io/docs/data-types/streams-tutorial/)) so if you want to
add guarantees to the Redis PubSub source, then you'll need to add a
`beam.Reshuffle` or a GBK to provide some sort of checkpointing on that data,
to avoid data loss.

The last thing I'll say, is that if you want a streaming job to drain
quickly, you'll also want to add support for [Truncating on
Drain](https://beam.apache.org/documentation/programming-guide/#truncating-during-drain),
which allows any restrictions to be "reduced" if necessary to some smaller
size. Otherwise, bounded restrictions will be executed to completion to avoid
dataloss. This only applies to runners that support Drains, which I think is
only Dataflow, but I do intended to support it for Prism eventually, because
something open source should be able to test that behavior.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] lostluck commented on issue #25212: [Feature Request]: Support streaming (pubsub) in RedisIO connector

Reply via email to