lostluck commented on issue #25212:
URL: https://github.com/apache/beam/issues/25212#issuecomment-1523668920

   Hi @shhivam ! Great question! I've been meaning to write a more direct 
tutorial about this, but lately I've been focused on the Go Direct Runner 
replacement, 
[Prism](https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/runners/prism)
 and helping get Timers into the SDK. Here's what I think. Please do feel free 
to ask questions, and do @ mention me to get my attention. I can't promise I'll 
reply fast, but I do try to help others go
   
   There's one way to create an unbounded source: An [Splittable 
DoFn](https://beam.apache.org/documentation/programming-guide/#splittable-dofns)
 that also returns [process 
continuations](https://beam.apache.org/documentation/programming-guide/#user-initiated-checkpoint)
 and critically: provides [watermark 
estimates.](https://beam.apache.org/documentation/programming-guide/#watermark-estimation).
   
   Alternatively, the 2.47.0 release will include a 
[`periodic.Impulse`](https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/transforms/periodic/periodic.go#L118)
 transform which will, "impulse" downstream transforms, periodically, making 
them unbounded. It can be thought of as a streaming replacement to 
`beam.Impulse`. 
   It also serves as an good example of how to write an unbounded source. Note 
that while we use the term "unbounded" they can in fact terminate if there's no 
more work, which allows a streaming execution pipeline to gracefully terminate 
if desired.
   
   For any kind of "database" model, I'd recommend starting with a batch source 
whose input elements are configuration objects for the datastore.
   
   For a pubsub approach, it is a little trickier to manage being able to split 
/ scale horizontally, and the appropriate ways to scale up and down depend on 
the underlying message broker's model. But as a first approximation, using 
periodic impulse to be able to "poll" whatever subscription is created for the 
job, seems reasonable.  You'll also want to use [Bundle 
Finalization](https://beam.apache.org/documentation/programming-guide/#bundle-finalization)
 to perform whatever "Acknlowledgement" the broker requires. 
   
   It looks like [Redis uses At-Most-Once 
semantics](https://redis.io/docs/manual/pubsub/#delivery-semantics) for Pub 
Sub, ( unless it's using [Redis 
streams](https://redis.io/docs/data-types/streams-tutorial/)) so if you want to 
add guarantees to the Redis PubSub source, then you'll need to add a 
`beam.Reshuffle` or a GBK to provide some sort of checkpointing on that data, 
to avoid data loss.
   
   The last thing I'll say, is that if you want a streaming job to drain 
quickly, you'll also want to add support for [Truncating on 
Drain](https://beam.apache.org/documentation/programming-guide/#truncating-during-drain),
 which allows any restrictions to be "reduced" if necessary to some smaller 
size. Otherwise, bounded restrictions will be executed to completion to avoid 
dataloss. This only applies to runners that support Drains, which I think is 
only Dataflow, but I do intended to support it for Prism eventually, because 
something open source should be able to test that behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to