Hi folks -

Many of you don't know me, as I don't contribute directly to Beam.  But I
do a lot of work around the periphery, in particular considering how to
manage and monitor Beam pipelines.

I think there's room in Beam to greatly improve both the management and
monitoring story, especially around external resources.  By far the most
common external resources in a pipeline are the data sources and sinks.
Nothing mentioned here is limited to those, and should be considered
equally valuable for any sort of RPC or other external connection made in a
pipeline.  But I will focus on I/O here to provide some focus.

The two questions I'd like Beamers to think about are:
1.  How could I easily monitor a Beam pipeline AND all of it's external
dependencies in a single monitoring experience?  How could I easily
distinguish the external dependencies of a Beam pipeline?

2.  How could I make a pipeline data source easily configurable so that I
could launch existing pipelines with a different data source easily?  Is it
possible to do this even if the type of source changes?  <<Note: A great
answer to this question might require rethinking templates a bit.  More on
that later :) >>

I'm attaching a doc with these questions / ideas fleshed out a bit more.  I
would love to hear your thoughts.  And if we end up with some consensus,
I'd love your help in creating a plan to engineer some solutions to these
ideas.

Thanks!
Andrea
{\rtf1\ansi\ansicpg1252\cocoartf1561\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 ArialMT;\f1\froman\fcharset0 Times-Roman;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red51\green51\blue51;}
{\*\expandedcolortbl;;\cssrgb\c0\c0\c0;\cssrgb\c26275\c26275\c26275;}
{\*\listtable{\list\listtemplateid1\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}.}{\leveltext\leveltemplateid1\'02\'00.;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}.}{\leveltext\leveltemplateid2\'02\'01.;}{\levelnumbers\'01;}\fi-360\li1440\lin1440 }{\listname ;}\listid1}
{\list\listtemplateid2\listhybrid{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat2\levelspace360\levelindent0{\*\levelmarker \{decimal\}.}{\leveltext\leveltemplateid101\'02\'00.;}{\levelnumbers\'01;}\fi-360\li720\lin720 }{\listlevel\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{decimal\}.}{\leveltext\leveltemplateid102\'02\'01.;}{\levelnumbers\'01;}\fi-360\li1440\lin1440 }{\listname ;}\listid2}}
{\*\listoverridetable{\listoverride\listid1\listoverridecount0\ls1}{\listoverride\listid2\listoverridecount0\ls2}}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\deftab720
\pard\pardeftab720\sl720\sa160\partightenfactor0

\f0\fs53\fsmilli26667 \cf2 \expnd0\expndtw0\kerning0
\outl0\strokewidth0 \strokec2 Beam I/O representation
\f1\b\fs48 \
\pard\pardeftab720\sl280\partightenfactor0

\b0\fs24 \cf2 \
\pard\pardeftab720\sl400\partightenfactor0

\f0\fs29\fsmilli14667 \cf2 In the current Beam programming model, sources and sinks are virtually indistinguishable from other transforms. \'a0From a composition point of view, this is great.  But from an integration point of view, special handling of external data sources would be very useful. \'a0I don't intend to propose any particular solution, but I will outline two use cases that would be great to support in Beam.  I will also note that generalizing these ideas to cover all external dependencies and not just data sources seems like a good idea.
\f1\fs24 \
\pard\pardeftab720\sl280\partightenfactor0
\cf2 \
\pard\tx220\tx720\pardeftab720\li720\fi-720\sl500\partightenfactor0
\ls1\ilvl0
\f0\fs37\fsmilli18667 \cf3 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext	1.	}\expnd0\expndtw0\kerning0
\outl0\strokewidth0 \strokec3 Externally configurable sources.
\b\fs43\fsmilli21840 \uc0\u8232 
\b0\fs37\fsmilli18667 \
\pard\tx940\tx1440\pardeftab720\li1440\fi-1440\sl400\partightenfactor0
\ls1\ilvl1
\fs29\fsmilli14667 \cf2 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext	1.	}\expnd0\expndtw0\kerning0
\outl0\strokewidth0 \strokec2 Each I/O type should be describable using a configuration proto specific to that I/O type. \'a0This would make configuring sources work the same between different SDK languages.  It would also make it obvious what parameterization is supported and how to use it, for example selecting a subset of BigQuery partitions.\uc0\u8232 \
\ls1\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext	2.	}\expnd0\expndtw0\kerning0
\outl0\strokewidth0 \strokec2 Tools that help a user construct a pipeline will likely already have some representation of the user's available data sources, something like a Data Lake or Data Hub. \'a0The easiest initial integration with these tools would be canned pipelines that support configuring with any type of data source.  To support this, we would need pipelines that could be configured at runtime with both the type of source and the configuration. This would look something like a fully generic "Source" transform that can accept a runtime configuration for any supported input type (PubSub, BigQuery, TextIO, etc)\uc0\u8232 \
\ls1\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext	3.	}\expnd0\expndtw0\kerning0
\outl0\strokewidth0 \strokec2 Supporting cross-language pipelines likely looks just like described in 'a', where the configured source is whatever type of in-between collection representation Beam decides to use for passing data between the two runtimes. \'a0We should use changes to support these cross-language pipelines to move us toward simpler, cleaner I/O configuration for all pipelines.\uc0\u8232 \
\pard\pardeftab720\sl280\partightenfactor0

\f1\fs24 \cf2 \
\pard\tx220\tx720\pardeftab720\li720\fi-720\sl500\partightenfactor0
\ls2\ilvl0
\f0\fs37\fsmilli18667 \cf3 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext	2.	}\expnd0\expndtw0\kerning0
\outl0\strokewidth0 \strokec3 Cross-pipeline monitoring
\b\fs43\fsmilli21840 \uc0\u8232 
\b0\fs37\fsmilli18667 \
\pard\tx940\tx1440\pardeftab720\li1440\fi-1440\sl400\partightenfactor0
\ls2\ilvl1
\fs29\fsmilli14667 \cf2 \kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext	1.	}\expnd0\expndtw0\kerning0
\outl0\strokewidth0 \strokec2 When users are monitoring their pipelines, they often need to monitor their data sources as well for quota, growing backlog or other issues. \'a0Right now, digging into the pipeline representation to find information about data sources is quite tricky.  It would be great if the Beam Pipeline proto could make external data sources a first class citizen so that they could be easily extracted by monitoring systems. \'a0Presumably, the representation presented in the proto could be the same ones used for configuration.  The data source description should make it clear in which transform it is accessed.  Additionally, we should avoid introducing a second copy of this data for this purpose; for correctness and consistency sake, the operation of the pipeline and consumption of this config for monitoring should access the same description.\uc0\u8232 \
\ls2\ilvl1\kerning1\expnd0\expndtw0 \outl0\strokewidth0 {\listtext	2.	}\expnd0\expndtw0\kerning0
\outl0\strokewidth0 \strokec2 In addition to a clear description of the data sources in the pipeline, it would be great for the Beam runtime to emit details around the data source when it is actually accessed as additional monitoring data. \'a0Since the exact data source may not be available in the description and may only be determined at runtime, Beam should export these details via monitoring data.  Additionally, Beam should emit monitoring data to confirm access to the data sources at runtime even if the description fully described the source.\uc0\u8232 \
}

Reply via email to