We recently started a discussion on the Flink ML about this topic: [1]

The gist of it is that for some use cases tracking the watermark per-key
instead of globally (or rather per partition) can be useful for some cases.
Think, for example, of tracking some user data off mobile phones where the
user-id/phone number is the key. For these cases, one slow key, i.e. a
disconnected phone, would slow down the watermark.

I'm not saying that this is a good idea, I'm actually skeptical because the
implementation seems quite complicated/costly to me and I have some doubts
about being able to track the watermark per-key in the sources. It's just
that more people seem to be asking about this lately and I would like to
have a discussion with the clever Beam people because we claim to be at the
forefront of parallel data processing APIs. :-)

Also note that I'm aware that in the example given above it would be
difficult to find out if one key is slow in the first place and therefore
hold the watermark for that key.

If you're interested, please have a look at the discussion on the Flink ML
that I linked to above. It's only 4 mails so far and Jamie gives a nicer
explanation of the possible use cases than I'm doing here. Note, that my
discussion of feasibility and APIs/key lineage should also apply to Beam.

What do you think?

[1]
https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E

Reply via email to