Re: A Declarative API for Apache Beam

Austin Bennett Fri, 16 Dec 2022 16:22:48 -0800

Seems a worthwhile addition which can expand the community by making Beam
increasingly accessible to additional users and for more use-cases.


A bit of a tangent, since commenting on @Byron Ellis <byronel...@google.com>'s
part, but ...  Ensuring some have also seen Dataform [ ex:
https://cloud.google.com/dataform/docs/overview ... and - formerly -
https://dataform.co/ ] , since now part of the same company as you, there
are potentially additional maybe-straightforward
conversations/lessons-learned/etc to discuss [ in addition to collabs with
the dbt community ].  At times, I think of these two [ dbt, dataform] as
addressing similar things.



On Thu, Dec 15, 2022 at 4:17 PM Ahmet Altay via dev <dev@beam.apache.org>
wrote:

> +1 to both of these proposals. In the past 12 months I have heard of at
> least 3 YAML implementations built on top of Beam in large production
> systems. Unfortunately, none of those were open sourced. Having these out
> of the box would be great, and it will clearly have used demand. Thank
> you all!
>
> On Thu, Dec 15, 2022 at 10:59 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> On Thu, Dec 15, 2022 at 3:37 AM Steven van Rossum
>> <sjvanros...@google.com> wrote:
>> >
>> > This is great! I developed a similar template a year or two ago as a
>> reference for a customer to speed up their development process and
>> unsurprisingly it did speed up their development.
>> > Here's an example of the config layout I came up with at the time:
>> >
>> > options:
>> >   runner: DirectRunner
>> >
>> > pipeline:
>> > # - &messages
>> > #   label: PubSub XML source
>> > #   transform:
>> > #     !PTransform:apache_beam.io.ReadFromPubSub
>> > #     subscription: projects/PROJECT/subscriptions/SUBSCRIPTION
>> > - &message_source_1
>> >   label: XML source 1
>> >   transform:
>> >     !PTransform:apache_beam.Create
>> >     values:
>> >     - /path/to/file.xml
>> > - &message_source_2
>> >   label: XML source 2
>> >   transform:
>> >     !PTransform:apache_beam.Create
>> >     values:
>> >     - /path/to/another/file.xml
>> > - &message_xml
>> >   label: XMLs
>> >   inputs:
>> >   - step: *message_source_1
>> >   - step: *message_source_2
>> >   transform:
>> >     !PTransform:utils.transforms.ParseXmlDocument {}
>> > - &validated_messages
>> >   label: Validate XMLs
>> >   inputs:
>> >   - step: *message_xml
>> >     tag: success
>> >   transform:
>> >     !PTransform:utils.transforms.ValidateXmlDocumentWithXmlSchema
>> >     schema: /path/to/file.xsd
>> > - &converted_messages
>> >   label: Convert XMLs
>> >   inputs:
>> >   - step: *validated_messages
>> >   transform:
>> >     !PTransform:utils.transforms.ConvertXmlDocumentToDictionary
>> >     schema: /path/to/file.xsd
>> > - label: Print XMLs
>> >   inputs:
>> >   - step: *converted_messages
>> >   transform:
>> >     !PTransform:utils.transforms.Print {}
>> >
>> > Highlights:
>> > Pipeline options are supplied under an options property.
>>
>> Yep, I was thinking exactly the same:
>>
>> https://github.com/apache/beam/blob/c5518014d47a42651df94419e3ccbc79eaf96cb3/sdks/python/apache_beam/yaml/main.py#L51
>>
>> > A pipeline is a flat set of all transforms in the pipeline.
>>
>> One can certainly enumerate the transforms as a flat set, but I do
>> think being able to define a composite structure is nice. In addition,
>> the "chain" composite allows one to automatically infer the
>> input-output relation rather than having to spell it out (much as one
>> can chain multiple transforms in the various SDKs rather than have to
>> assign each result to a intermediate).
>>
>> > Transforms are defined using a YAML tag and named properties and can be
>> used by constructing a YAML reference.
>>
>> That's an interesting idea. Can it be done inline as well?
>>
>> > DAG construction is done using a simple topological sort of transforms
>> and their dependencies.
>>
>> Same.
>>
>> > Named side outputs can be referenced using a tag field.
>>
>> I didn't put this in any of the examples, but I do the same. If a
>> transform Foo produces multiple outputs, one can (in fact must)
>> reference the various outputs by Foo.output1, Foo.output2, etc.
>>
>> > Multiple inputs are merged with a Flatten transform.
>>
>> PTransfoms can have named inputs as well (they're not always
>> symmetric), so I let inputs be a map if they care to distinguish them.
>>
>> > Not sure if there's any inspiration left to take from this, but I
>> figured I'd throw it up here to share.
>>
>> Thanks. It's neat to see others coming up with the same idea, with
>> very similar conventions, and validates that it'd be both natural and
>> useful.
>>
>>
>> > On Thu, Dec 15, 2022 at 12:48 AM Chamikara Jayalath via dev <
>> dev@beam.apache.org> wrote:
>> >>
>> >> +1 for these proposals and agree that these will simplify and
>> demystify Beam for many new users. I think when combined with the
>> x-lang/Schema-Aware transform binding, these might end up being adequate
>> solutions for many production use-cases as well (unless users need to
>> define custom composites, I/O connectors, etc.).
>> >>
>> >> Also, thanks for providing prototype implementations with examples.
>> >>
>> >> - Cham
>> >>
>> >>
>> >> On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev <
>> dev@beam.apache.org> wrote:
>> >>>
>> >>> To build on Kenn's point, if we leverage existing stuff like dbt we
>> get access to a ready made community which can help drive both adoption and
>> incremental innovation by bringing more folks to Beam
>> >>>
>> >>> On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles <k...@apache.org>
>> wrote:
>> >>>>
>> >>>> 1. I love the idea. Back in the early days people talked about an
>> "XML SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at
>> the time. Portability and specifically cross-language schema transforms
>> gives the right infrastructure so this is the perfect time: unique names
>> (URNs) for transforms and explicit lists of parameters they require.
>> >>>>
>> >>>> 2. I like the idea of re-using some existing thing like dbt if it is
>> pretty much what we were going to do anyhow. I don't think we should hold
>> ourselves back. I also don't think we'll gain anything in terms of
>> implementation. But at least it could fast-forward our design process
>> because we simply don't have to make most of the decisions because they are
>> made for us.
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev <
>> dev@beam.apache.org> wrote:
>> >>>>>
>> >>>>> And I guess also a PR for completeness to make it easier to find
>> going forward instead of my random repo:
>> https://github.com/apache/beam/pull/24670
>> >>>>>
>> >>>>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <byronel...@google.com>
>> wrote:
>> >>>>>>
>> >>>>>> Since Robert opened that can of worms (and we happened to talk
>> about it yesterday)... :-)
>> >>>>>>
>> >>>>>> I figured I'd also share my start on a "port" of dbt to the Beam
>> SDK. This would be complementary as it doesn't really provide a way of
>> specifying a pipeline, more orchestrating and packaging a complex
>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem
>> like reasonable things for Beam and it wouldn't be a stretch to include
>> something like the format above. Though in my head I had imagined people
>> would tend to write composite transforms in the SDK of their choosing that
>> are then exposed at this layer. I decided to go with dbt as it also
>> provides a number of nice "quality of life" features for its users like
>> documentation, validation, environments and so on,
>> >>>>>>
>> >>>>>> I did a really quick proof-of-viability implementation here:
>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>> >>>>>>
>> >>>>>> And you can see a really simple pipeline that reads a seed file
>> (TextIO), runs it through a couple of SQLTransforms and then drops it out
>> to a logger via a simple DoFn here:
>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>> >>>>>>
>> >>>>>> I've also heard a rumor there might also be a textproto-based
>> representation floating around too :-)
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> B
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <
>> dev@beam.apache.org> wrote:
>> >>>>>>>
>> >>>>>>> Hello Robert,
>> >>>>>>>
>> >>>>>>> I'm replying to say that I've been waiting for something like
>> this ever since I started learning Beam and I'm grateful you are pushing
>> this forward.
>> >>>>>>>
>> >>>>>>> Best,
>> >>>>>>>
>> >>>>>>> Damon
>> >>>>>>>
>> >>>>>>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <
>> rober...@google.com> wrote:
>> >>>>>>>>
>> >>>>>>>> While Beam provides powerful APIs for authoring sophisticated
>> data
>> >>>>>>>> processing pipelines, it often still has too high a barrier for
>> >>>>>>>> getting started and authoring simple pipelines. Even setting up
>> the
>> >>>>>>>> environment, installing the dependencies, and setting up the
>> project
>> >>>>>>>> can be an overwhelming amount of boilerplate for some (though
>> >>>>>>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a
>> long
>> >>>>>>>> way in making this easier). At the other extreme, the Dataflow
>> project
>> >>>>>>>> has the notion of templates which are pre-built Beam pipelines
>> that
>> >>>>>>>> can be easily launched from the command line, or even from your
>> >>>>>>>> browser, but they are fairly restrictive, limited to
>> pre-assembled
>> >>>>>>>> pipelines taking a small number of parameters.
>> >>>>>>>>
>> >>>>>>>> The idea of creating a yaml-based description of pipelines has
>> come up
>> >>>>>>>> several times in several contexts and this last week I decided
>> to code
>> >>>>>>>> up what it could look like. Here's a proposal.
>> >>>>>>>>
>> >>>>>>>> pipeline:
>> >>>>>>>>   - type: chain
>> >>>>>>>>     transforms:
>> >>>>>>>>       - type: ReadFromText
>> >>>>>>>>         args:
>> >>>>>>>>          file_pattern: "wordcount.yaml"
>> >>>>>>>>       - type: PyMap
>> >>>>>>>>         fn: "str.lower"
>> >>>>>>>>       - type: PyFlatMap
>> >>>>>>>>         fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>> >>>>>>>>       - type: PyTransform
>> >>>>>>>>         name: Count
>> >>>>>>>>         constructor:
>> "apache_beam.transforms.combiners.Count.PerElement"
>> >>>>>>>>       - type: PyMap
>> >>>>>>>>         fn: str
>> >>>>>>>>       - type: WriteToText
>> >>>>>>>>         file_path_prefix: "counts.txt"
>> >>>>>>>>
>> >>>>>>>> Some more examples at
>> >>>>>>>>
>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>> >>>>>>>>
>> >>>>>>>> A prototype (feedback welcome) can be found at
>> >>>>>>>> https://github.com/apache/beam/pull/24667. It can be invoked as
>> >>>>>>>>
>> >>>>>>>>     python -m apache_beam.yaml.main --pipeline_spec_file
>> >>>>>>>> [path/to/file.yaml] [other_pipene_args]
>> >>>>>>>>
>> >>>>>>>> or
>> >>>>>>>>
>> >>>>>>>>     python -m apache_beam.yaml.main --pipeline_spec
>> [yaml_contents]
>> >>>>>>>> [other_pipene_args]
>> >>>>>>>>
>> >>>>>>>> For example, to play around with this one could do
>> >>>>>>>>
>> >>>>>>>>     python -m apache_beam.yaml.main  \
>> >>>>>>>>         --pipeline_spec "$(curl
>> >>>>>>>>
>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>> )"
>> >>>>>>>> \
>> >>>>>>>>         --runner=apache_beam.runners.render.RenderRunner \
>> >>>>>>>>         --render_out=out.png
>> >>>>>>>>
>> >>>>>>>> Alternatively one can run it as a docker container with no need
>> to
>> >>>>>>>> install any SDK
>> >>>>>>>>
>> >>>>>>>>     docker run --rm \
>> >>>>>>>>         --entrypoint /usr/local/bin/python \
>> >>>>>>>>         gcr.io/apache-beam-testing/yaml_template:dev
>> >>>>>>>> /dataflow/template/main.py \
>> >>>>>>>>         --pipeline_spec="$(curl
>> >>>>>>>>
>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>> )"
>> >>>>>>>>
>> >>>>>>>> Though of course one would have to set up the appropriate mount
>> points
>> >>>>>>>> to do any local filesystem io and/or credentials.
>> >>>>>>>>
>> >>>>>>>> This is also available as a Dataflow template and can be invoked
>> as
>> >>>>>>>>
>> >>>>>>>>     gcloud dataflow flex-template run \
>> >>>>>>>>         "yaml-template-job" \
>> >>>>>>>>          --template-file-gcs-location
>> >>>>>>>> gs://apache-beam-testing-robertwb/yaml_template.json \
>> >>>>>>>>         --parameters ^~^pipeline_spec="$(curl
>> >>>>>>>>
>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>> )"
>> >>>>>>>> \
>> >>>>>>>>         --parameters pickle_library=cloudpickle \
>> >>>>>>>>         --project=apache-beam-testing \
>> >>>>>>>>         --region us-central1
>> >>>>>>>>
>> >>>>>>>> (Note the escaping required for the parameter (use cat for a
>> local
>> >>>>>>>> file), and the debug cycle here could be greatly improved, so I'd
>> >>>>>>>> recommend trying things locally first.)
>> >>>>>>>>
>> >>>>>>>> A key point of this implementation is that it heavily uses the
>> >>>>>>>> expansion service and cross language transforms, tying into the
>> >>>>>>>> proposal at  https://s.apache.org/easy-multi-language . Though
>> all the
>> >>>>>>>> examples use transforms defined in the Beam SDK, any
>> appropriately
>> >>>>>>>> packaged libraries may be used.
>> >>>>>>>>
>> >>>>>>>> There are many ways this could be extended. For example
>> >>>>>>>>
>> >>>>>>>> * It would be useful to be able to templatize yaml descriptions.
>> This
>> >>>>>>>> could be done with $SIGIL type notation or some other way. This
>> would
>> >>>>>>>> even allow one to define reusable, parameterized composite
>> PTransform
>> >>>>>>>> types in yaml itself.
>> >>>>>>>>
>> >>>>>>>> * It would be good to have a more principled way of merging
>> >>>>>>>> environments. Currently each set of dependencies is a unique Beam
>> >>>>>>>> environment, and while Beam has sophisticated cross-language
>> >>>>>>>> capabilities, it would be nice if environments sharing the same
>> >>>>>>>> language (and likely also the same Beam version) could be fused
>> >>>>>>>> in-process (e.g. with separate class loaders or compatibility
>> checks
>> >>>>>>>> for packages).
>> >>>>>>>>
>> >>>>>>>> * Publishing and discovery of transformations could be improved,
>> >>>>>>>> possibly via shared standards and some kind of a transform
>> catalog. An
>> >>>>>>>> ecosystem of easily sharable transforms (similar to what
>> huggingface
>> >>>>>>>> provides for ML models) could provide a useful platform for
>> making it
>> >>>>>>>> easy to build pipelines and open up Beam to a whole new set of
>> users.
>> >>>>>>>>
>> >>>>>>>> Let me know what you think.
>> >>>>>>>>
>> >>>>>>>> - Robert
>>
>

Re: A Declarative API for Apache Beam

Reply via email to