Re: Dropping late data in DirectRunner

Jan Lukavský Fri, 03 Jan 2020 13:03:31 -0800

Yes, the non-reliability of late data dropping in distributed runner isunderstood. But this is even where DirectRunner can play its role,because only there it is actually possible to emulate and test specificwatermark conditions. Question regarding this for the java DirectRunner- should we completely drop LataDataDroppingDoFnRunner and delegate thelate data dropping to StatefulDoFnRunner? Seems logical to me, as if weagree that late data should always be dropped, then there would no"valid" use of StatefulDoFnRunner without the late data droppingfunctionality.


On 1/3/20 9:32 PM, Robert Bradshaw wrote:

I agree, in fact we just recently enabled late data dropping to thedirect runner in Python to be able to develop better tests for Dataflow.
It should be noted, however, that in a distributed runner (absent thequiessence of TestStream) that one can't *count* on late data beingdropped at a certain point, and in fact (due to delays in fullypropagating the watermark) late data can even become on-time, so thepromises about what happens behind the watermark are necessarily a bitloose.
On Fri, Jan 3, 2020 at 9:15 AM Luke Cwik <[email protected]<mailto:[email protected]>> wrote:
    I agree that the DirectRunner should drop late data. Late data
    dropping is optional but the DirectRunner is used by many for
    testing and we should have the same behaviour they would get on
    other runners or users may be surprised.

    On Fri, Jan 3, 2020 at 3:33 AM Jan Lukavský <[email protected]
    <mailto:[email protected]>> wrote:

        Hi,

        I just found out that DirectRunner is apparently not using
        LateDataDroppingDoFnRunner, which means that it doesn't drop
        late data
        in cases where there is no GBK operation involved (dropping in
        GBK seems
        to be correct). There is apparently no
        @Category(ValidatesRunner) test
        for that behavior (because DirectRunner would fail it), so the
        question
        is - should late data dropping be considered part of model (of
        which
        DirectRunner should be a canonical implementation) and
        therefore that
        should be fixed there, or is the late data dropping an
        optional feature
        of a runner?

        I'm strongly in favor of the first option, and I think it is
        likely that
        all real-world runners would probably adhere to that (I didn't
        check
        that, though).

        Opinions?

          Jan

Re: Dropping late data in DirectRunner

Reply via email to