This is a really interesting topic. I think Beam is a good place to have a
broader cross-runner discussion of this.

I know that you are not the only person skeptical due to
trickiness/costliness of implementation.

On the other hand, at a semantic level, I think any correct definition will
allow a per-PCollection watermark to serve as just a "very bad" per-key
watermark. So I would target a definition whereby any of these is
acceptable:

1. a runner with no awareness of per-key watermarks
2. a runner that chooses only sometimes to use them (there might be some
user-facing cost/latency tradeoff a la triggers here)
3. a runner that chooses a coarse approximation for tracking key lineage

Can you come up with an example where this is impossible?

Developing advanced model features whether or not all runners do (or can)
support them is exactly the spirit of Beam, to me, so I am really glad you
brought this up here.

On Mon, Feb 27, 2017 at 5:59 AM, Aljoscha Krettek <aljos...@apache.org>
wrote:

> We recently started a discussion on the Flink ML about this topic: [1]
>
> The gist of it is that for some use cases tracking the watermark per-key
> instead of globally (or rather per partition) can be useful for some cases.
> Think, for example, of tracking some user data off mobile phones where the
> user-id/phone number is the key. For these cases, one slow key, i.e. a
> disconnected phone, would slow down the watermark.
>
> I'm not saying that this is a good idea, I'm actually skeptical because the
> implementation seems quite complicated/costly to me and I have some doubts
> about being able to track the watermark per-key in the sources. It's just
> that more people seem to be asking about this lately and I would like to
> have a discussion with the clever Beam people because we claim to be at the
> forefront of parallel data processing APIs. :-)
>
> Also note that I'm aware that in the example given above it would be
> difficult to find out if one key is slow in the first place and therefore
> hold the watermark for that key.
>
> If you're interested, please have a look at the discussion on the Flink ML
> that I linked to above. It's only 4 mails so far and Jamie gives a nicer
> explanation of the possible use cases than I'm doing here. Note, that my
> discussion of feasibility and APIs/key lineage should also apply to Beam.
>
> What do you think?
>
> [1]
> https://lists.apache.org/thread.html/2b90d5b1d5e2654212cfbbcc6510ef
> 424bbafc4fadb164bd5aff9216@%3Cdev.flink.apache.org%3E
>

Reply via email to