Re: A Declarative API for Apache Beam

Robert Bradshaw via dev Thu, 15 Dec 2022 10:59:16 -0800

On Thu, Dec 15, 2022 at 3:37 AM Steven van Rossum
<sjvanros...@google.com> wrote:
>
> This is great! I developed a similar template a year or two ago as a 
> reference for a customer to speed up their development process and 
> unsurprisingly it did speed up their development.
> Here's an example of the config layout I came up with at the time:
>
> options:
>   runner: DirectRunner
>
> pipeline:
> # - &messages
> #   label: PubSub XML source
> #   transform:
> #     !PTransform:apache_beam.io.ReadFromPubSub
> #     subscription: projects/PROJECT/subscriptions/SUBSCRIPTION
> - &message_source_1
>   label: XML source 1
>   transform:
>     !PTransform:apache_beam.Create
>     values:
>     - /path/to/file.xml
> - &message_source_2
>   label: XML source 2
>   transform:
>     !PTransform:apache_beam.Create
>     values:
>     - /path/to/another/file.xml
> - &message_xml
>   label: XMLs
>   inputs:
>   - step: *message_source_1
>   - step: *message_source_2
>   transform:
>     !PTransform:utils.transforms.ParseXmlDocument {}
> - &validated_messages
>   label: Validate XMLs
>   inputs:
>   - step: *message_xml
>     tag: success
>   transform:
>     !PTransform:utils.transforms.ValidateXmlDocumentWithXmlSchema
>     schema: /path/to/file.xsd
> - &converted_messages
>   label: Convert XMLs
>   inputs:
>   - step: *validated_messages
>   transform:
>     !PTransform:utils.transforms.ConvertXmlDocumentToDictionary
>     schema: /path/to/file.xsd
> - label: Print XMLs
>   inputs:
>   - step: *converted_messages
>   transform:
>     !PTransform:utils.transforms.Print {}
>
> Highlights:
> Pipeline options are supplied under an options property.


Yep, I was thinking exactly the same:
https://github.com/apache/beam/blob/c5518014d47a42651df94419e3ccbc79eaf96cb3/sdks/python/apache_beam/yaml/main.py#L51

> A pipeline is a flat set of all transforms in the pipeline.

One can certainly enumerate the transforms as a flat set, but I do
think being able to define a composite structure is nice. In addition,
the "chain" composite allows one to automatically infer the
input-output relation rather than having to spell it out (much as one
can chain multiple transforms in the various SDKs rather than have to
assign each result to a intermediate).

> Transforms are defined using a YAML tag and named properties and can be used 
> by constructing a YAML reference.

That's an interesting idea. Can it be done inline as well?

> DAG construction is done using a simple topological sort of transforms and 
> their dependencies.

Same.

> Named side outputs can be referenced using a tag field.

I didn't put this in any of the examples, but I do the same. If a
transform Foo produces multiple outputs, one can (in fact must)
reference the various outputs by Foo.output1, Foo.output2, etc.

> Multiple inputs are merged with a Flatten transform.

PTransfoms can have named inputs as well (they're not always
symmetric), so I let inputs be a map if they care to distinguish them.

> Not sure if there's any inspiration left to take from this, but I figured I'd 
> throw it up here to share.

Thanks. It's neat to see others coming up with the same idea, with
very similar conventions, and validates that it'd be both natural and
useful.


> On Thu, Dec 15, 2022 at 12:48 AM Chamikara Jayalath via dev 
> <dev@beam.apache.org> wrote:
>>
>> +1 for these proposals and agree that these will simplify and demystify Beam 
>> for many new users. I think when combined with the x-lang/Schema-Aware 
>> transform binding, these might end up being adequate solutions for many 
>> production use-cases as well (unless users need to define custom composites, 
>> I/O connectors, etc.).
>>
>> Also, thanks for providing prototype implementations with examples.
>>
>> - Cham
>>
>>
>> On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev <dev@beam.apache.org> 
>> wrote:
>>>
>>> To build on Kenn's point, if we leverage existing stuff like dbt we get 
>>> access to a ready made community which can help drive both adoption and 
>>> incremental innovation by bringing more folks to Beam
>>>
>>> On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles <k...@apache.org> wrote:
>>>>
>>>> 1. I love the idea. Back in the early days people talked about an "XML 
>>>> SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the 
>>>> time. Portability and specifically cross-language schema transforms gives 
>>>> the right infrastructure so this is the perfect time: unique names (URNs) 
>>>> for transforms and explicit lists of parameters they require.
>>>>
>>>> 2. I like the idea of re-using some existing thing like dbt if it is 
>>>> pretty much what we were going to do anyhow. I don't think we should hold 
>>>> ourselves back. I also don't think we'll gain anything in terms of 
>>>> implementation. But at least it could fast-forward our design process 
>>>> because we simply don't have to make most of the decisions because they 
>>>> are made for us.
>>>>
>>>>
>>>>
>>>> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev <dev@beam.apache.org> 
>>>> wrote:
>>>>>
>>>>> And I guess also a PR for completeness to make it easier to find going 
>>>>> forward instead of my random repo: 
>>>>> https://github.com/apache/beam/pull/24670
>>>>>
>>>>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <byronel...@google.com> wrote:
>>>>>>
>>>>>> Since Robert opened that can of worms (and we happened to talk about it 
>>>>>> yesterday)... :-)
>>>>>>
>>>>>> I figured I'd also share my start on a "port" of dbt to the Beam SDK. 
>>>>>> This would be complementary as it doesn't really provide a way of 
>>>>>> specifying a pipeline, more orchestrating and packaging a complex 
>>>>>> pipeline---dbt itself supports SQL and Python Dataframes, which both 
>>>>>> seem like reasonable things for Beam and it wouldn't be a stretch to 
>>>>>> include something like the format above. Though in my head I had 
>>>>>> imagined people would tend to write composite transforms in the SDK of 
>>>>>> their choosing that are then exposed at this layer. I decided to go with 
>>>>>> dbt as it also provides a number of nice "quality of life" features for 
>>>>>> its users like documentation, validation, environments and so on,
>>>>>>
>>>>>> I did a really quick proof-of-viability implementation here: 
>>>>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>>>>>>
>>>>>> And you can see a really simple pipeline that reads a seed file 
>>>>>> (TextIO), runs it through a couple of SQLTransforms and then drops it 
>>>>>> out to a logger via a simple DoFn here: 
>>>>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>>>>>>
>>>>>> I've also heard a rumor there might also be a textproto-based 
>>>>>> representation floating around too :-)
>>>>>>
>>>>>> Best,
>>>>>> B
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev 
>>>>>> <dev@beam.apache.org> wrote:
>>>>>>>
>>>>>>> Hello Robert,
>>>>>>>
>>>>>>> I'm replying to say that I've been waiting for something like this ever 
>>>>>>> since I started learning Beam and I'm grateful you are pushing this 
>>>>>>> forward.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Damon
>>>>>>>
>>>>>>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <rober...@google.com> 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> While Beam provides powerful APIs for authoring sophisticated data
>>>>>>>> processing pipelines, it often still has too high a barrier for
>>>>>>>> getting started and authoring simple pipelines. Even setting up the
>>>>>>>> environment, installing the dependencies, and setting up the project
>>>>>>>> can be an overwhelming amount of boilerplate for some (though
>>>>>>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
>>>>>>>> way in making this easier). At the other extreme, the Dataflow project
>>>>>>>> has the notion of templates which are pre-built Beam pipelines that
>>>>>>>> can be easily launched from the command line, or even from your
>>>>>>>> browser, but they are fairly restrictive, limited to pre-assembled
>>>>>>>> pipelines taking a small number of parameters.
>>>>>>>>
>>>>>>>> The idea of creating a yaml-based description of pipelines has come up
>>>>>>>> several times in several contexts and this last week I decided to code
>>>>>>>> up what it could look like. Here's a proposal.
>>>>>>>>
>>>>>>>> pipeline:
>>>>>>>>   - type: chain
>>>>>>>>     transforms:
>>>>>>>>       - type: ReadFromText
>>>>>>>>         args:
>>>>>>>>          file_pattern: "wordcount.yaml"
>>>>>>>>       - type: PyMap
>>>>>>>>         fn: "str.lower"
>>>>>>>>       - type: PyFlatMap
>>>>>>>>         fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>>>>>>>>       - type: PyTransform
>>>>>>>>         name: Count
>>>>>>>>         constructor: 
>>>>>>>> "apache_beam.transforms.combiners.Count.PerElement"
>>>>>>>>       - type: PyMap
>>>>>>>>         fn: str
>>>>>>>>       - type: WriteToText
>>>>>>>>         file_path_prefix: "counts.txt"
>>>>>>>>
>>>>>>>> Some more examples at
>>>>>>>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>>>>>>>>
>>>>>>>> A prototype (feedback welcome) can be found at
>>>>>>>> https://github.com/apache/beam/pull/24667. It can be invoked as
>>>>>>>>
>>>>>>>>     python -m apache_beam.yaml.main --pipeline_spec_file
>>>>>>>> [path/to/file.yaml] [other_pipene_args]
>>>>>>>>
>>>>>>>> or
>>>>>>>>
>>>>>>>>     python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
>>>>>>>> [other_pipene_args]
>>>>>>>>
>>>>>>>> For example, to play around with this one could do
>>>>>>>>
>>>>>>>>     python -m apache_beam.yaml.main  \
>>>>>>>>         --pipeline_spec "$(curl
>>>>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml)"
>>>>>>>> \
>>>>>>>>         --runner=apache_beam.runners.render.RenderRunner \
>>>>>>>>         --render_out=out.png
>>>>>>>>
>>>>>>>> Alternatively one can run it as a docker container with no need to
>>>>>>>> install any SDK
>>>>>>>>
>>>>>>>>     docker run --rm \
>>>>>>>>         --entrypoint /usr/local/bin/python \
>>>>>>>>         gcr.io/apache-beam-testing/yaml_template:dev
>>>>>>>> /dataflow/template/main.py \
>>>>>>>>         --pipeline_spec="$(curl
>>>>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml)"
>>>>>>>>
>>>>>>>> Though of course one would have to set up the appropriate mount points
>>>>>>>> to do any local filesystem io and/or credentials.
>>>>>>>>
>>>>>>>> This is also available as a Dataflow template and can be invoked as
>>>>>>>>
>>>>>>>>     gcloud dataflow flex-template run \
>>>>>>>>         "yaml-template-job" \
>>>>>>>>          --template-file-gcs-location
>>>>>>>> gs://apache-beam-testing-robertwb/yaml_template.json \
>>>>>>>>         --parameters ^~^pipeline_spec="$(curl
>>>>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml)"
>>>>>>>> \
>>>>>>>>         --parameters pickle_library=cloudpickle \
>>>>>>>>         --project=apache-beam-testing \
>>>>>>>>         --region us-central1
>>>>>>>>
>>>>>>>> (Note the escaping required for the parameter (use cat for a local
>>>>>>>> file), and the debug cycle here could be greatly improved, so I'd
>>>>>>>> recommend trying things locally first.)
>>>>>>>>
>>>>>>>> A key point of this implementation is that it heavily uses the
>>>>>>>> expansion service and cross language transforms, tying into the
>>>>>>>> proposal at  https://s.apache.org/easy-multi-language . Though all the
>>>>>>>> examples use transforms defined in the Beam SDK, any appropriately
>>>>>>>> packaged libraries may be used.
>>>>>>>>
>>>>>>>> There are many ways this could be extended. For example
>>>>>>>>
>>>>>>>> * It would be useful to be able to templatize yaml descriptions. This
>>>>>>>> could be done with $SIGIL type notation or some other way. This would
>>>>>>>> even allow one to define reusable, parameterized composite PTransform
>>>>>>>> types in yaml itself.
>>>>>>>>
>>>>>>>> * It would be good to have a more principled way of merging
>>>>>>>> environments. Currently each set of dependencies is a unique Beam
>>>>>>>> environment, and while Beam has sophisticated cross-language
>>>>>>>> capabilities, it would be nice if environments sharing the same
>>>>>>>> language (and likely also the same Beam version) could be fused
>>>>>>>> in-process (e.g. with separate class loaders or compatibility checks
>>>>>>>> for packages).
>>>>>>>>
>>>>>>>> * Publishing and discovery of transformations could be improved,
>>>>>>>> possibly via shared standards and some kind of a transform catalog. An
>>>>>>>> ecosystem of easily sharable transforms (similar to what huggingface
>>>>>>>> provides for ML models) could provide a useful platform for making it
>>>>>>>> easy to build pipelines and open up Beam to a whole new set of users.
>>>>>>>>
>>>>>>>> Let me know what you think.
>>>>>>>>
>>>>>>>> - Robert

Re: A Declarative API for Apache Beam

Reply via email to