I have created a new PR (https://github.com/apache/beam/pull/16998) because
the previous one closed for some reason and I cannot re-open it (apologies
about that). Our use-case for this is to import some Twitter specific
plugins on the worker. Please let me know if there is anything else needed
on this.

Thanks & Regards,
Rahul Iyer


On Tue, Mar 1, 2022 at 4:25 PM Robert Bradshaw <[email protected]> wrote:

> On Tue, Mar 1, 2022 at 12:36 PM Valentyn Tymofieiev <[email protected]>
> wrote:
>
>> +1 as well.
>>
>> One Beam user mentioned a while ago they needed to run smth like `import
>> tensorflow_addons; tensorflow_addons.register.register_all()` as a very
>> first operation once the Python process starts. Looks like importing a
>> plugin as a module would allow exactly that.
>>
>> Does anyone know why FileSystems use the Plugin interface? Does this mean
>> that some aspects of the FS functionality didn't work for portable
>> pipelines all this time but we didn't notice it?
>>
>
> It would be broken if someone tried to use a FileSystem that was not
> imported by default.
>
>
>>
>>
>>
>> On Tue, Mar 1, 2022 at 10:16 AM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> +1 to this change which looks like an omission in runner v2.
>>>
>>> The intent of beam plugins is to enforce that modules imported during
>>> pipeline construction will also get imported on the workers (e.g. if
>>> certain setup or registration needs to be performed). (Python doesn't have
>>> service registries like Java.) The other piece of this is populating the
>>> pipeline option with BeamPlugin.get_all_plugin_paths() which looks like
>>> it's only done on the Dataflow runner:
>>> https://github.com/apache/beam/blob/release-2.36.0/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L511
>>>
>>> I have no idea why SdkContainerImageBuilder extends beam.Plugin though...
>>>
>>>
>>> On Tue, Mar 1, 2022 at 8:51 AM Ahmet Altay <[email protected]> wrote:
>>>
>>>> Hi Rahul,
>>>>
>>>> Your change looks reasonable to me. And it brings dataflow python
>>>> runners v1 and v2 closer.
>>>>
>>>> Related to removing the experimental, I have some questions: What is
>>>> the use case for beam plugins? I do not see them being used often. IIRC the
>>>> main use was importing modules before starting the worker. There is also a
>>>> beam plugin clearly intended for runner v2 (custom containers) [1], how
>>>> does that work on V2?
>>>>
>>>> Ahmet
>>>>
>>>> /cc @Yichi Zhang <[email protected]> @Valentyn Tymofieiev
>>>> <[email protected]> @Ryan Thompson <[email protected]>
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/1636a3ace2c039a433cf45c51e4cee1151054730/sdks/python/apache_beam/runners/portability/sdk_container_builder.py#L68
>>>>
>>>> On Thu, Feb 24, 2022 at 10:14 AM Rahul Iyer <[email protected]> wrote:
>>>>
>>>>> Good Morning/Afternoon/Evening folks,
>>>>>
>>>>> The current support for beam-plugins is experimental and we would like
>>>>> to have it as a first class member of the beam library for Python Runner
>>>>> v2. This helps us load plugins into the runtime before starting the
>>>>> SdkHarness. https://github.com/apache/beam/pull/16920 is a PR I
>>>>> created for this purpose. Wanted to gather some thoughts around the
>>>>> approach here and have it standardized. The current implementation of beam
>>>>> plugins allows users to extend a class from BeamPlugin and it gets
>>>>> automatically populated in the --beam_plugin PipelineOption, e.g.:
>>>>> FileSystem
>>>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py#L475>.
>>>>> This creates the pipeline option as,
>>>>>
>>>>> --beam_plugin=[
>>>>>
>>>>>   'apache_beam.io.aws.s3filesystem.S3FileSystem',
>>>>>
>>>>>   'apache_beam.io.filesystem.FileSystem',
>>>>>
>>>>>   'apache_beam.io.hadoopfilesystem.HadoopFileSystem',
>>>>>
>>>>>   'apache_beam.io.localfilesystem.LocalFileSystem',
>>>>>
>>>>>   'apache_beam.io.gcp.gcsfilesystem.GCSFileSystem',
>>>>>
>>>>>   'apache_beam.io.azure.blobstoragefilesystem.BlobStorageFileSystem'
>>>>>
>>>>> ]
>>>>>
>>>>> Another way is to provide a module via the --beam_plugin
>>>>> PipelineOption, e.g.:
>>>>>
>>>>> --beam_plugin='twitter.beam.rule_the_world'
>>>>>
>>>>> The current implementation in the PR supports both these approaches
>>>>> but would love to have a standardized way forward and have it documented.
>>>>> Would love to hear your thoughts about this.
>>>>>
>>>>> Thanks & Regards,
>>>>> Rahul Iyer
>>>>>
>>>>

Reply via email to