Hi Can,

Thanks for the update. Interesting question. Flink has an optimization built in called chaining which works together nicely with Beam. Essentially, operators which share the same partitioning get executed one after another inside a master operator. This saves resources.

Interestingly, Beam's Fuser for portable Runners does something similar. AFAIK there is no built-in solution for the old-style Runners. I think it would be possible to build something like this on top of the existing translation.

Cheers,
Max

On 20.03.19 13:07, Can Gencer wrote:
Hi again,

We've made some progress on the runner since writing this more than a month ago, the repo is available here publicly: https://github.com/hazelcast/hazelcast-jet-beam-runner

Still very much a work in progress though. One of the issues I wanted to raise is that currently we're translating each PTransform to a Jet Vertex (could be consider analogous to a Flink operator or a vertex in Tez). This is sub-optimal, since Beam creates lots of transforms for computations that could be performed inside the same Vertex, such as subsequent mapping transforms and many others. Ideally you only need distinct vertices where the data is re-partitioned and/or shuffled. I'm curious if Beam offers some way of translating the PTransform graph to a more minimal set of transforms, i.e. some kind of planner or would this have to be custom code? We've done a similar integration with Cascading in the past and it offered a planner which given a set of rules would partition the Cascading DAG into a minimal set of vertices for the same DAG. Curious if Beam has any similar functionality?



On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles <[email protected] <mailto:[email protected]>> wrote:

    Elaborating on what Robert alluded to: when I wrote that runner
    author guide, portability was in its infancy. Now Beam Python can be
    run on Flink. So that guide is primarily focused on the "deserialize
    a Java DoFn and call its methods" approach. A decent amount of it is
    still really important to know, but is now the responsibility of the
    "SDK harness", aka language-specific coprocessor. For Python & Go &
    <insert new SDK language here> you really want to use the
    portability protos and the portable Flink runner is the best model.

    Kenn


    On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw <[email protected]
    <mailto:[email protected]>> wrote:

        On Fri, Feb 15, 2019 at 7:36 AM Can Gencer <[email protected]
        <mailto:[email protected]>> wrote:
         >
         > We at Hazelcast are looking into writing a Beam runner for
        Hazelcast Jet (https://github.com/hazelcast/hazelcast-jet). I
        wanted to introduce myself as we'll likely have questions as we
        start development.

        Welcome!

        Hazelcast looks interesting, a Beam runner for it would be very
        cool.

         > Some of the things I'm wondering about currently:
         >
         > * Currently there seems to be a guide available at
        https://beam.apache.org/contribute/runner-guide/ , is this up to
        date? Is there anything in specific to be aware of when starting
        with a new runner that's not covered here?

        That looks like a pretty good starting point. At a quick glance, I
        don't see anything that looks out of date. Another resource that
        might
        be helpful is a talk from last year on writing an SDK (but as it
        mostly covers the runner-sdk interaction, it's also quite useful for
        understanding the runner side:
        
https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
        And please feel free to ask any questions on this list as well; we'd
        be happy to help.

         > * Should we be targeting the latest master which is at
        2.12-SNAPSHOT or a stable version?

        I would target the latest master.

         > * After a runner is developed, how is the maintenance
        typically handled, as the runners seems to be part of Beam codebase?

        Either is possible. Several runner adapters are part of the Beam
        codebase, but for example the IMB Streams Beam runner is not. There
        are certainly pros and cons (certainly early on when the APIs
        themselves were under heavy development it was easier to keep things
        in sync in the same codebase, but things have mostly stabilized
        now).
        A runner only becomes part of the Beam codebase if there are members
        of the community committed to maintaining it (which could include
        you). Both approaches are fine.

        - Robert

Reply via email to