Adding a StepMetadataRegistry for Python SDK

Pablo Estrada Wed, 28 Mar 2018 17:13:49 -0700

Hello all,
I've filed https://issues.apache.org/jira/browse/BEAM-3955, to consider the
possibility of adding some sort of facility to translate different names
for the runners.
This is currently a problem in Dataflow, where steps can have different
names in the backend and in the SDK.
This is observable in Beam code, where different parts of the
SDK/worker/runners use different names in their metrics:


- Logging uses Beam transform names (e.g. Foo/Bar)
- Metrics uses operation_name (e.g. s2)
- Statesampler uses operation_name.
- The Dataflow worker sets step_name to operation_name after creating the
operation.

I'd like to propose the following design outline:

   - Create an e*xecution context *that will allow runners to provide their
   specific functionality*.*
   - Execution context will be able to provide multiple runner-specific
   functionality (e.g. side input fetchers).
   - In this case, the execution contexts can have a StepNameRegistry, or
   StepRegistry, or StepMetadataRegistry of some kind, where step names and
   other metadata can be enrolled.
   - Runners can pass their execution contexts to operations, logging, and
   other modules.
   - Beam core can then switch to use Beam step names, and each runner's
   specific monitoring / metrics / etc classes can have their own logic for
   accessing these.
   - This would also allow us to remove the LoggingContext tracking, and
   rely only on statesampler for context tracking.

Eventually, all of this should be fully contained in the portability API
and runners won't have to deal with these issues, but for now it seems like
a good compromise.

If this sounds good, I'll start working to implement that.
Note that this is only a rough description, and I'm open to reconsider any
and all aspects.

Best
-P.
-- 
Got feedback? go/pabloem-feedback

Adding a StepMetadataRegistry for Python SDK

Reply via email to