I have created a new PR (https://github.com/apache/beam/pull/16998) because the previous one closed for some reason and I cannot re-open it (apologies about that). Our use-case for this is to import some Twitter specific plugins on the worker. Please let me know if there is anything else needed on this.
Thanks & Regards, Rahul Iyer On Tue, Mar 1, 2022 at 4:25 PM Robert Bradshaw <[email protected]> wrote: > On Tue, Mar 1, 2022 at 12:36 PM Valentyn Tymofieiev <[email protected]> > wrote: > >> +1 as well. >> >> One Beam user mentioned a while ago they needed to run smth like `import >> tensorflow_addons; tensorflow_addons.register.register_all()` as a very >> first operation once the Python process starts. Looks like importing a >> plugin as a module would allow exactly that. >> >> Does anyone know why FileSystems use the Plugin interface? Does this mean >> that some aspects of the FS functionality didn't work for portable >> pipelines all this time but we didn't notice it? >> > > It would be broken if someone tried to use a FileSystem that was not > imported by default. > > >> >> >> >> On Tue, Mar 1, 2022 at 10:16 AM Robert Bradshaw <[email protected]> >> wrote: >> >>> +1 to this change which looks like an omission in runner v2. >>> >>> The intent of beam plugins is to enforce that modules imported during >>> pipeline construction will also get imported on the workers (e.g. if >>> certain setup or registration needs to be performed). (Python doesn't have >>> service registries like Java.) The other piece of this is populating the >>> pipeline option with BeamPlugin.get_all_plugin_paths() which looks like >>> it's only done on the Dataflow runner: >>> https://github.com/apache/beam/blob/release-2.36.0/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L511 >>> >>> I have no idea why SdkContainerImageBuilder extends beam.Plugin though... >>> >>> >>> On Tue, Mar 1, 2022 at 8:51 AM Ahmet Altay <[email protected]> wrote: >>> >>>> Hi Rahul, >>>> >>>> Your change looks reasonable to me. And it brings dataflow python >>>> runners v1 and v2 closer. >>>> >>>> Related to removing the experimental, I have some questions: What is >>>> the use case for beam plugins? I do not see them being used often. IIRC the >>>> main use was importing modules before starting the worker. There is also a >>>> beam plugin clearly intended for runner v2 (custom containers) [1], how >>>> does that work on V2? >>>> >>>> Ahmet >>>> >>>> /cc @Yichi Zhang <[email protected]> @Valentyn Tymofieiev >>>> <[email protected]> @Ryan Thompson <[email protected]> >>>> >>>> [1] >>>> https://github.com/apache/beam/blob/1636a3ace2c039a433cf45c51e4cee1151054730/sdks/python/apache_beam/runners/portability/sdk_container_builder.py#L68 >>>> >>>> On Thu, Feb 24, 2022 at 10:14 AM Rahul Iyer <[email protected]> wrote: >>>> >>>>> Good Morning/Afternoon/Evening folks, >>>>> >>>>> The current support for beam-plugins is experimental and we would like >>>>> to have it as a first class member of the beam library for Python Runner >>>>> v2. This helps us load plugins into the runtime before starting the >>>>> SdkHarness. https://github.com/apache/beam/pull/16920 is a PR I >>>>> created for this purpose. Wanted to gather some thoughts around the >>>>> approach here and have it standardized. The current implementation of beam >>>>> plugins allows users to extend a class from BeamPlugin and it gets >>>>> automatically populated in the --beam_plugin PipelineOption, e.g.: >>>>> FileSystem >>>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py#L475>. >>>>> This creates the pipeline option as, >>>>> >>>>> --beam_plugin=[ >>>>> >>>>> 'apache_beam.io.aws.s3filesystem.S3FileSystem', >>>>> >>>>> 'apache_beam.io.filesystem.FileSystem', >>>>> >>>>> 'apache_beam.io.hadoopfilesystem.HadoopFileSystem', >>>>> >>>>> 'apache_beam.io.localfilesystem.LocalFileSystem', >>>>> >>>>> 'apache_beam.io.gcp.gcsfilesystem.GCSFileSystem', >>>>> >>>>> 'apache_beam.io.azure.blobstoragefilesystem.BlobStorageFileSystem' >>>>> >>>>> ] >>>>> >>>>> Another way is to provide a module via the --beam_plugin >>>>> PipelineOption, e.g.: >>>>> >>>>> --beam_plugin='twitter.beam.rule_the_world' >>>>> >>>>> The current implementation in the PR supports both these approaches >>>>> but would love to have a standardized way forward and have it documented. >>>>> Would love to hear your thoughts about this. >>>>> >>>>> Thanks & Regards, >>>>> Rahul Iyer >>>>> >>>>
