[
https://issues.apache.org/jira/browse/BEAM-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266262#comment-15266262
]
Aljoscha Krettek commented on BEAM-241:
---------------------------------------
I think the short-term (and maybe also long-term) solution is to do everything
via {{DoFnRunners}}. Then we automatically get the correct behavior. What do
you think? Should I open an issue for the Flink runner?
I also want to unify how we treat {{DoFn}}s and windowing operations, or do you
think that we shouldn't use a {{GroupAlsoByWindowViaWindowSetDoFn}} once the
new runner API is in?
> Not easy for runners to get late-data dropping
> ----------------------------------------------
>
> Key: BEAM-241
> URL: https://issues.apache.org/jira/browse/BEAM-241
> Project: Beam
> Issue Type: Bug
> Components: runner-core
> Reporter: Mark Shields
> Assignee: Frances Perry
>
> Quite by accident realized the Flink runner delegates to
> GroupAlsoByWindowViaWindowSetDoFn for GBK, which in turn delegates to
> ReduceFnRunner. The latter now assumes no messages will arrive beyond the
> 'garbage collection' time of their target window(s).
> The Dataflow runner interposes a LateDataDroppingDoFnRunner into the path so
> as to drop those too-late messages. That's done (I think) using
> DoFnRunners.createDefault.
> I don't think the Flink runner does that.
> We need a nice runner-friendly way of dealing with the too-late data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)