rmetzger commented on issue #6594: [FLINK-9311] [pubsub] Added PubSub source 
connector with support for checkpointing (ATLEAST_ONCE)
URL: https://github.com/apache/flink/pull/6594#issuecomment-468698788
 
 
   I've tested the PR a bit again, and I talked with @StephanEwen about the 
`MultipleIdsMessageAcknowledgingSourceBase`. 
   I have the following observations to share:
   1. Stephan is proposing to follow Apache Beam in the implementation of the 
PubSubSource, because the `MultipleIdsMessageAcknowledgingSourceBase` only 
provides exactly-once with a parallelism = 1. Apache Beam has a source for 
PubSub that only guarantees message delivery (aka at least once). After the 
sources, they have a keyed window operator, which deduplicates messages based 
on a unique id within a certain time-window (say 10 minutes), making this 
exactly-once within that time-window.
   2. The PubSub source's performance is limited by the checkpointing 
frequency. When running the source with a checkpoint once every 1000ms, I 
achieve a throughput of 3000 msg/sec. With a checkpoint frequency of 50ms, I 
achieve a throughput of 20.000 msg/s.
   The other parameter to tweak is the number of source instances. With 80 
sources and a checkpoint interval of 1000ms, I achieve 65.000 msg/s.
   3. The PubSub grpc system seems to be quite CPU intensive. The PubSub Sink 
can very easily burn my CPU. At a throughput of 15k msg/s, my CPU is already at 
600%. I could not find anything in the code that would explain this behavior. 
The CPU sampler indicated that the grpc/netty code of the PubSub client is 
eating up many CPU cycles.
   
   What’s your take on these observations?
   
   I’m thinking the following for them:
   1. We should investigate if this is easily possible to implement. If so, we 
could provide a fairly “bare” PubSubSource, which exposes the message ID 
downstream, so that we can provide a generic deduplication window (for this and 
other connectors).
If we go that route, I would propose to make this PR only 
about the Source itself, and provide the deduplication logic in a follow up PR.
   2. Requiring users to run large topologies with very high checkpointing 
frequencies (or a high number of pubsub sources is not a good user experience. 
We should consider making the ack-behavior configurable. We either ack on 
checkpoint completion (current behavior, and my proposed default behavior), or 
we ack immediately. Acking immediately will basically make the source 
potentially loose records on failures, but it should be very fast.
   3. I don’t think we can do anything about that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to