Apache Beam YAML makes heavy use of schemas to both provide
high-level, semantically meaningful transforms and to more painlessly
facilitate mixing and matching transforms across language boundaries.
This works well where we are able to infer the schemas, but requires
painful manual declarations where we are not (PubSub inputs being a
prime example). There are also some cases where we do not care about
the full structure of the input data (e.g. we are augmenting or
filtering based on a few fields) or even care about it at all (e.g.
the downstream can consume dynamically-schema'd data, like BigQuery
write).

These usecases are not handled well in the current system, but have
proven to be important for many Beam users (e.g. as attested to by
Dataflow templates usage). We would like to be able to easily and
naturally support such usecases in Beam YAML as well. Note that
Unknown Schema'd data is different (and possibly more flexible) than
fully Unschema'd data (such as arbitrary Python or Java objects).

I've written up a doc exploring this and some possible solutions at
https://s.apache.org/beam-yaml-unknown-schema and would welcome any
feedback or ideas people have on the idea.

- Robert

Reply via email to