[
https://issues.apache.org/jira/browse/FLINK-20625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jakob Edding updated FLINK-20625:
---------------------------------
Description:
The Source implementation for Google Cloud Pub/Sub should be refactored in
accordance with [FLIP-27: Refactor Source
Interface|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95653748].
*Split Enumerator*
Pub/Sub doesn't expose any partitions to consuming applications. Therefore, the
implementation of the Pub/Sub Split Enumerator won't do any real work
discovery. Instead, a static Source Split is handed to Source Readers which
request a Source Split. This static Source Split merely contains details about
the connection to Pub/Sub and the concrete Pub/Sub subscription to use but no
Split-specific information like partitions/offsets because this information can
not be obtained.
*Source Reader*
A Source Reader will use Pub/Sub's pull mechanism to read new messages from the
Pub/Sub subscription specified in the SourceSplit. In the case of
parallel-running Source Readers in Flink, every Source Reader will be passed
the same Source Split from the Enumerator. Because of this, all Source Readers
use the same connection details and the same Pub/Sub subscription to receive
messages. In this case, Pub/Sub will automatically load-balance messages
between all Source Readers pulling from the same subscription. This way,
parallel processing can be achieved in the Source.
*At-least-once guarantee*
Pub/Sub itself guarantees at-least-once message delivery so it is the goal to
keep up this guarantee in the Source as well. A mechanism that can be used to
achieve this is that Pub/Sub expects a message to be acknowledged by the
subscriber to signal that the message has been consumed successfully. Any
message that has not been acknowledged yet will be automatically redelivered by
Pub/Sub once an ack deadline has passed.
After a certain time interval has elapsed...
# all pulled messages are checkpointed in the Source Reader
# same messages are acknowledged to Pub/Sub
# same messages are forwarded to downstream Flink tasks
This should ensure at-least-once delivery in the Source because in the case of
failure, non-checkpointed messages have not yet been acknowledged and will
therefore be redelivered.
Because of the static Source Split, it appears like checkpointing is not
necessary in the Split Enumerator.
was:
The Source implementation for Google Cloud Pub/Sub should be refactored in
accordance with [FLIP-27: Refactor Source
Interface|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95653748].
> Refactor Google Cloud PubSub Source in accordance with FLIP-27
> --------------------------------------------------------------
>
> Key: FLINK-20625
> URL: https://issues.apache.org/jira/browse/FLINK-20625
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / Google Cloud PubSub
> Reporter: Jakob Edding
> Priority: Major
>
> The Source implementation for Google Cloud Pub/Sub should be refactored in
> accordance with [FLIP-27: Refactor Source
> Interface|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95653748].
> *Split Enumerator*
> Pub/Sub doesn't expose any partitions to consuming applications. Therefore,
> the implementation of the Pub/Sub Split Enumerator won't do any real work
> discovery. Instead, a static Source Split is handed to Source Readers which
> request a Source Split. This static Source Split merely contains details
> about the connection to Pub/Sub and the concrete Pub/Sub subscription to use
> but no Split-specific information like partitions/offsets because this
> information can not be obtained.
> *Source Reader*
> A Source Reader will use Pub/Sub's pull mechanism to read new messages from
> the Pub/Sub subscription specified in the SourceSplit. In the case of
> parallel-running Source Readers in Flink, every Source Reader will be passed
> the same Source Split from the Enumerator. Because of this, all Source
> Readers use the same connection details and the same Pub/Sub subscription to
> receive messages. In this case, Pub/Sub will automatically load-balance
> messages between all Source Readers pulling from the same subscription. This
> way, parallel processing can be achieved in the Source.
> *At-least-once guarantee*
> Pub/Sub itself guarantees at-least-once message delivery so it is the goal to
> keep up this guarantee in the Source as well. A mechanism that can be used to
> achieve this is that Pub/Sub expects a message to be acknowledged by the
> subscriber to signal that the message has been consumed successfully. Any
> message that has not been acknowledged yet will be automatically redelivered
> by Pub/Sub once an ack deadline has passed.
> After a certain time interval has elapsed...
> # all pulled messages are checkpointed in the Source Reader
> # same messages are acknowledged to Pub/Sub
> # same messages are forwarded to downstream Flink tasks
> This should ensure at-least-once delivery in the Source because in the case
> of failure, non-checkpointed messages have not yet been acknowledged and will
> therefore be redelivered.
> Because of the static Source Split, it appears like checkpointing is not
> necessary in the Split Enumerator.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)