[
https://issues.apache.org/jira/browse/FLINK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050031#comment-17050031
]
Yuan Mei edited comment on FLINK-15670 at 3/4/20, 9:11 AM:
-----------------------------------------------------------
[~sewen]
Need to chat a bit for two things:
# Redefine the scope of the problem, at least for 1.11;
# Handle watermark when multiple subtasks writing to the same partition
** This is a common problem for intermediate persistency, not just for Kafka
** The current mechanism relies on downstream `ExecutionVertex` to progress
watermark. However, in the case of a sink, there is no such thing as
`downstream OP`.
** I was thinking if there is a coordinator of all subtasks of a
ExecutionJobVertex then the watermark progress logic can be handled in the
coordinator
** I find there is an interface `OperatorCoordinator` that may be able to be
used in this case. But the only two usages of it is under `test`
*A bit more details for reference:*
- Downstream watermark is handled in `StreamTaskNetworkInput.processElement`
-> `StatusWatermarkValue.inputWatermark`
In such a case, the watermark in each channel is kept and aligned until
reaching downstream.
- Upstream data is written and buffered through `ChannelSelectorRecordWriter`,
which maintains bufferBuilders for each subpartition (channel).
Refer to `RecordWriterOutput.collect` and `RecordWriterOutput.emitWatermark`
for records and watermark emit respectively.
was (Author: ym):
[~sewen]
Need to chat a bit for two things:
# Redefine the scope of the problem, at least for 1.11;
# Watermark handling when multiple subtasks writing to the same partition
** This is a common problem for intermediate persistency, not just for Kafka
** The current mechanism relies on downstream `ExecutionVertex` to progress
watermark. However, in the case of a sink, there is no such thing as
`downstream OP`.
** I was thinking if there is a coordinator of all subtasks of a
ExecutionJobVertex then the watermark progress logic can be handled in the
coordinator
** I find there is an interface `OperatorCoordinator` that may be able to be
used in this case. But the only two usages of it is under `test`
*A bit more details for reference:*
- Downstream watermark is handled in
`StreamTaskNetworkInput.processElement` ->
`StatusWatermarkValue.inputWatermark`
In such a case, the watermark in each channel is kept and aligned until
reaching downstream.
- Upstream data is written and buffered through `ChannelSelectorRecordWriter`,
which maintains bufferBuilders for each subpartition (channel).
Refer to `RecordWriterOutput.collect` and `RecordWriterOutput.emitWatermark`
for records and watermark emit respectively.
> Provide a Kafka Source/Sink pair that aligns Kafka's Partitions and Flink's
> KeyGroups
> -------------------------------------------------------------------------------------
>
> Key: FLINK-15670
> URL: https://issues.apache.org/jira/browse/FLINK-15670
> Project: Flink
> Issue Type: New Feature
> Components: API / DataStream, Connectors / Kafka
> Reporter: Stephan Ewen
> Priority: Major
> Labels: usability
> Fix For: 1.11.0
>
>
> This Source/Sink pair would serve two purposes:
> 1. You can read topics that are already partitioned by key and process them
> without partitioning them again (avoid shuffles)
> 2. You can use this to shuffle through Kafka, thereby decomposing the job
> into smaller jobs and independent pipelined regions that fail over
> independently.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)