and then have Flink manage the parallelism for stages downstream from 
that?@Pablo Can you clarify what you mean by that?

Let me paraphrase this just to get a clear understanding. There are two approaches to test portable streaming pipelines:

a) Use an Impulse followed by a test PTransform which generates testing data. This is similar to how streaming sources work which don't use the Read Transform. For basic testing this should work, even without support for Timers.

b) Introduce a new URN which gets translated to a native Flink/Spark/xy testing transform.

We should go for a) as this will make testing easier across portable runners. We previously discussed native transforms will be an option in Beam, but it would be preferable to leave them out of testing for now.

Thanks,
Max


On 28.09.18 21:14, Thomas Weise wrote:
Thanks for sharing the link, this looks very promising!

For the synthetic source, if we need a runner native trigger mechanism, then it should probably just emit an empty byte array like the impulse implementation does, and everything else could be left to SDK specific transforms that are downstream. We don't have support for timers in the portable Flink runner yet. With timers, there would not be the need for a runner native URN and it could work just like Pablo described.


On Fri, Sep 28, 2018 at 3:09 AM Łukasz Gajowy <[email protected] <mailto:[email protected]>> wrote:

    Hi all,

    thank you, Thomas, for starting this discussion and Pablo for
    sharing the ideas. FWIW adding here, we discussed this in terms of
    Core Beam Transform Load Tests that we are working on right now [1].
    If generating synthetic data will be possible for portable streaming
    pipelines, we could use it in our work to test Python streaming
    scenarios.

    [1] _https://s.apache.org/GVMa_

    pt., 28 wrz 2018 o 08:18 Pablo Estrada <[email protected]
    <mailto:[email protected]>> napisał(a):

        Hi Thomas, all,
        yes, this is quite important for testing, and in fact I'd think
        it's important to streamline the insertion of native sources
        from different runners to make the current runners more usable.
        But that's another topic.

        For generators of synthetic data, I had a couple ideas (and this
        will show my limited knowledge about Flink and Streaming, but oh
        well):

        - Flink experts: Is it possible to add a pure-Beam generator
        that will do something like: Impulse -> ParDo(generate multiple
        elements) -> Forced "Write" to Flink (e.g. something like a
        reshuffle), and then have Flink manage the parallelism for
        stages downstream from that?

        - If this is not possible, it may be worth writing some
        transform in Flink / other runners that can be plugged in by
        inserting a custom URN. In fact, it may be a good idea to
        streamline the insertion of native sources for each runner based
        on some sort of CustomURNTransform() ?

        I hope I did not butcher those explanations too badly...
        Best
        -P.

        On Thu, Sep 27, 2018, 5:55 PM Thomas Weise <[email protected]
        <mailto:[email protected]>> wrote:

            There were a few discussions how we can facilitate testing
            for portable streaming pipelines with the Flink runner. The
            problem is that we currently don't have streaming sources in
            the Python SDK.

            One way to support testing could be a generator that extends
            the idea of Impulse to provide a Flink native trigger
            transform, optionally parameterized with an interval and max
            count.

            Test pipelines could then follow the generator with a Map
            function that creates whatever payloads are desirable.

            Thoughts?

            Thanks,
            Thomas

Reply via email to