Here's the raw doc content, for your convenience.Beam I/O representation In the current Beam programming model, sources and sinks are virtually indistinguishable from other transforms. From a composition point of view, this is great. But from an integration point of view, special handling of external data sources would be very useful. I don't intend to propose any particular solution, but I will outline two use cases that would be great to support in Beam. I will also note that generalizing these ideas to cover all external dependencies and not just data sources seems like a good idea.
1. Externally configurable sources. 1. Each I/O type should be describable using a configuration proto specific to that I/O type. This would make configuring sources work the same between different SDK languages. It would also make it obvious what parameterization is supported and how to use it, for example selecting a subset of BigQuery partitions. 2. Tools that help a user construct a pipeline will likely already have some representation of the user's available data sources, something like a Data Lake or Data Hub. The easiest initial integration with these tools would be canned pipelines that support configuring with any type of data source. To support this, we would need pipelines that could be configured at runtime with both the type of source and the configuration. This would look something like a fully generic "Source" transform that can accept a runtime configuration for any supported input type (PubSub, BigQuery, TextIO, etc) 3. Supporting cross-language pipelines likely looks just like described in 'a', where the configured source is whatever type of in-between collection representation Beam decides to use for passing data between the two runtimes. We should use changes to support these cross-language pipelines to move us toward simpler, cleaner I/O configuration for all pipelines. 1. Cross-pipeline monitoring 1. When users are monitoring their pipelines, they often need to monitor their data sources as well for quota, growing backlog or other issues. Right now, digging into the pipeline representation to find information about data sources is quite tricky. It would be great if the Beam Pipeline proto could make external data sources a first class citizen so that they could be easily extracted by monitoring systems. Presumably, the representation presented in the proto could be the same ones used for configuration. The data source description should make it clear in which transform it is accessed. Additionally, we should avoid introducing a second copy of this data for this purpose; for correctness and consistency sake, the operation of the pipeline and consumption of this config for monitoring should access the same description. 2. In addition to a clear description of the data sources in the pipeline, it would be great for the Beam runtime to emit details around the data source when it is actually accessed as additional monitoring data. Since the exact data source may not be available in the description and may only be determined at runtime, Beam should export these details via monitoring data. Additionally, Beam should emit monitoring data to confirm access to the data sources at runtime even if the description fully described the source. On Tue, Aug 14, 2018 at 10:43 AM Andrea Foegler <foeg...@google.com> wrote: > Hi folks - > > Many of you don't know me, as I don't contribute directly to Beam. But I > do a lot of work around the periphery, in particular considering how to > manage and monitor Beam pipelines. > > I think there's room in Beam to greatly improve both the management and > monitoring story, especially around external resources. By far the most > common external resources in a pipeline are the data sources and sinks. > Nothing mentioned here is limited to those, and should be considered > equally valuable for any sort of RPC or other external connection made in a > pipeline. But I will focus on I/O here to provide some focus. > > The two questions I'd like Beamers to think about are: > 1. How could I easily monitor a Beam pipeline AND all of it's external > dependencies in a single monitoring experience? How could I easily > distinguish the external dependencies of a Beam pipeline? > > 2. How could I make a pipeline data source easily configurable so that I > could launch existing pipelines with a different data source easily? Is it > possible to do this even if the type of source changes? <<Note: A great > answer to this question might require rethinking templates a bit. More on > that later :) >> > > I'm attaching a doc with these questions / ideas fleshed out a bit more. > I would love to hear your thoughts. And if we end up with some consensus, > I'd love your help in creating a plan to engineer some solutions to these > ideas. > > Thanks! > Andrea > > >