rmetzger commented on issue #6594: [FLINK-9311] [pubsub] Added PubSub source connector with support for checkpointing (ATLEAST_ONCE) URL: https://github.com/apache/flink/pull/6594#issuecomment-468698788 I've tested the PR a bit again, and I talked with @StephanEwen about the `MultipleIdsMessageAcknowledgingSourceBase`. I have the following observations to share: 1. Stephan is proposing to follow Apache Beam in the implementation of the PubSubSource, because the `MultipleIdsMessageAcknowledgingSourceBase` only provides exactly-once with a parallelism = 1. Apache Beam has a source for PubSub that only guarantees message delivery (aka at least once). After the sources, they have a keyed window operator, which deduplicates messages based on a unique id within a certain time-window (say 10 minutes), making this exactly-once within that time-window. 2. The PubSub source's performance is limited by the checkpointing frequency. When running the source with a checkpoint once every 1000ms, I achieve a throughput of 3000 msg/sec. With a checkpoint frequency of 50ms, I achieve a throughput of 20.000 msg/s. The other parameter to tweak is the number of source instances. With 80 sources and a checkpoint interval of 1000ms, I achieve 65.000 msg/s. 3. The PubSub grpc system seems to be quite CPU intensive. The PubSub Sink can very easily burn my CPU. At a throughput of 15k msg/s, my CPU is already at 600%. I could not find anything in the code that would explain this behavior. The CPU sampler indicated that the grpc/netty code of the PubSub client is eating up many CPU cycles. What’s your take on these observations? I’m thinking the following for them: 1. We should investigate if this is easily possible to implement. If so, we could provide a fairly “bare” PubSubSource, which exposes the message ID downstream, so that we can provide a generic deduplication window (for this and other connectors). If we go that route, I would propose to make this PR only about the Source itself, and provide the deduplication logic in a follow up PR. 2. Requiring users to run large topologies with very high checkpointing frequencies (or a high number of pubsub sources is not a good user experience. We should consider making the ack-behavior configurable. We either ack on checkpoint completion (current behavior, and my proposed default behavior), or we ack immediately. Acking immediately will basically make the source potentially loose records on failures, but it should be very fast. 3. I don’t think we can do anything about that.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
