[GitHub] [flink] AHeise edited a comment on issue #11725: [FLINK-15670][API] Provide a Kafka Source/Sink pair as KafkaShuffle

GitBox Tue, 21 Apr 2020 03:52:27 -0700


AHeise edited a comment on issue #11725:
URL: https://github.com/apache/flink/pull/11725#issuecomment-617104290



   >     1. If I am understanding correctly (please ignore if I am not), the 
original idea is to make use of the data already partitioned. So that users do 
not do an extra "reinterpret".
   > 
   > 
   > The problem is KeyedStream needs a specific way to decide key -> 
PartitionID `KeyGroupRangeAssignment.assignKeyToParallelOperator`. If the Keyed 
data in Kafka does not follow this hash function, it has to do a repartition 
(keyBy).
   > 
   > In other words, there is no easy way to guarantee the way an external 
system partitions data using the same way Flink uses. I think that's why 
Stephan suggests providing a Kafka Writer as well in the Jira. In this way how 
data is partitioned is controlled internally by us.
   > 
   > `reinterpretAsKeyedStream` itself does not need an extra shuffle (if 
that's the concern).
   
   First off, I haven't fully considered of using Kafka as an external shuffle 
service. I know of plans to have persistent communication channels, which have 
a high conceptional overlap with `KafkaShuffle`. I fear that once these 
persistent channels come, it may confuse users when to use what.
   
   Then, my second thought on `KafkaShuffle` is, why not directly provide it as 
an `ShuffleServiceFactory`? Do we want a more selective approach on the shuffle 
service or is the current interface not suitable (e.g., how to determine topic 
name)?
   
   So while I'm still not convinced of the general direction (and I originally 
had something else in mind), I'm assuming for now that this `KafkaShuffle` 
already gives some value to users. Then the topic should be almost treated as 
an implementation detail.
   
   >     2. How people can reuse the data
   >        We can provide a read API for people to read, that should not be 
difficult to do (without letting them worrying about watermark), they only need 
to provide a data schema.
   
   On the premise of having a pure shuffle service, I agree that this is an 
implementation detail. I was wondering if users of external systems may also 
use the data as it would be hard to provide a general purpose read API.
   
   >     3. Why watermark is needed
   >     4. Why watermark is designed in this way?
   
   Agreed to both if this is an implementation
   
   >     5. Why I have an extra sink function and operator class:
   >        To avoid the effects/changes in the current interface. 
SinkFunctions and Operators are broadly used and I do not want to cause 
confusion or potential risks to our users.
   
   Usually more code - especially duplicated code - leads to higher maintenance 
costs in the future. In particular, the operator is `@Internal`, so users 
should not use it directly and we do not give any guarantees on stableness. On 
the other hand, you modified the `@Public` `SinkFunction`, which needs to be 
carefully done (=have someone from SDK team look at it). I could see it being 
useful as a general extension independent of `KafkaShuffle`. Or we could 
actually add this method to a special `SinkFunctionWithWatermarks` or so. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] AHeise edited a comment on issue #11725: [FLINK-15670][API] Provide a Kafka Source/Sink pair as KafkaShuffle

Reply via email to