Yes, the non-reliability of late data dropping in distributed runner is understood. But this is even where DirectRunner can play its role, because only there it is actually possible to emulate and test specific watermark conditions. Question regarding this for the java DirectRunner - should we completely drop LataDataDroppingDoFnRunner and delegate the late data dropping to StatefulDoFnRunner? Seems logical to me, as if we agree that late data should always be dropped, then there would no "valid" use of StatefulDoFnRunner without the late data dropping functionality.

On 1/3/20 9:32 PM, Robert Bradshaw wrote:
I agree, in fact we just recently enabled late data dropping to the direct runner in Python to be able to develop better tests for Dataflow.

It should be noted, however, that in a distributed runner (absent the quiessence of TestStream) that one can't *count* on late data being dropped at a certain point, and in fact (due to delays in fully propagating the watermark) late data can even become on-time, so the promises about what happens behind the watermark are necessarily a bit loose.

On Fri, Jan 3, 2020 at 9:15 AM Luke Cwik <[email protected] <mailto:[email protected]>> wrote:

    I agree that the DirectRunner should drop late data. Late data
    dropping is optional but the DirectRunner is used by many for
    testing and we should have the same behaviour they would get on
    other runners or users may be surprised.

    On Fri, Jan 3, 2020 at 3:33 AM Jan Lukavský <[email protected]
    <mailto:[email protected]>> wrote:

        Hi,

        I just found out that DirectRunner is apparently not using
        LateDataDroppingDoFnRunner, which means that it doesn't drop
        late data
        in cases where there is no GBK operation involved (dropping in
        GBK seems
        to be correct). There is apparently no
        @Category(ValidatesRunner) test
        for that behavior (because DirectRunner would fail it), so the
        question
        is - should late data dropping be considered part of model (of
        which
        DirectRunner should be a canonical implementation) and
        therefore that
        should be fixed there, or is the late data dropping an
        optional feature
        of a runner?

        I'm strongly in favor of the first option, and I think it is
        likely that
        all real-world runners would probably adhere to that (I didn't
        check
        that, though).

        Opinions?

          Jan

Reply via email to