Input/output PCollection types at least have to be portable Beam types [1]
for cross-language to work.

I think we restricted schema-aware transforms to PCollection<Row> since Row
was expected to be an efficient replacement for arbitrary portable Beam
types (not sure how true that is in practice currently).

Thanks,
Cham

[1]
https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829

On Tue, May 30, 2023 at 1:54 PM Byron Ellis <byronel...@google.com> wrote:

> Is it actually necessary for a PTransform that is configured via the
> Schema mechanism to also be one that uses RowCoder? Those strike me as two
> separate concerns and unnecessarily limiting.
>
> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <chamik...@google.com>
> wrote:
>
>> +1 for the simplification.
>>
>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> Yeah. Essentially one needs do (1) name the arguments and (2) implement
>>> the transform. Hopefully (1) could be done in a concise way that allows for
>>> easy consumption from both Java and cross-language.
>>>
>>
>> +1 but I think the hard part today is to convert existing PTransforms to
>> be schema-aware transform compatible (for example, change input/output
>> types and make sure parameters take Beam Schema compatible types). But this
>> makes sense for new transforms.
>>
>>
>>
>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <byronel...@google.com>
>>> wrote:
>>>
>>>> Or perhaps the other way around? If you have a Schema we can
>>>> auto-generate the associated builder on the PTransform? Either way, more
>>>> DRY.
>>>>
>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> +1 to this simplification, it's a historical artifact that provides no
>>>>> value.
>>>>>
>>>>> I would love it if we also looked into ways to auto-generate the
>>>>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>> ideally with something as simple as adding a decorator to the PTransform
>>>>> itself.
>>>>>
>>>>>
>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>> service. From what I see, there may be a redundant step in developing
>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>> - SchemaTransformProvider [1]
>>>>>> - A configuration object
>>>>>> - SchemaTransform [2]
>>>>>>
>>>>>> The API is generally used like this:
>>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>>> returns a SchemaTransform
>>>>>> 2. The SchemaTransform is used to build a PTransform according to the
>>>>>> configuration
>>>>>>
>>>>>> In these steps, the SchemaTransform class seems unnecessary. We can
>>>>>> combine the two steps if we have SchemaTransformProvider return the
>>>>>> PTransform directly.
>>>>>>
>>>>>> We can then remove the SchemaTransform class as it will be obsolete.
>>>>>> This should be safe to do; the only place it's used in our API is here 
>>>>>> [3],
>>>>>> and that can be simplified if we make this change (we'd just trim `
>>>>>> .buildTransform()` off the end as `provider.from(configRow)` will
>>>>>> directly return the PTransform).
>>>>>>
>>>>>> I'd like to first mention that I was not involved in the design
>>>>>> process of this API so I may be missing some information on why it was 
>>>>>> set
>>>>>> up this way.
>>>>>>
>>>>>> A few developers already raised questions about how there's seemingly
>>>>>> unnecessary boilerplate involved in making a Java transform portable. I
>>>>>> wasn't involved in the design process of this API so I may be missing 
>>>>>> some
>>>>>> information, but my assumption is this was designed to follow the pattern
>>>>>> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
>>>>>> SchemaIO[5] -> PTransform. However, with the newer
>>>>>> SchemaTransformProvider API, we dropped a few methods and reduced the
>>>>>> SchemaTransform class to have a generic buildTransform() method. See the
>>>>>> example of PubsubReadSchemaTransformProvider [6], where the
>>>>>> SchemaTransform interface and buildTransform method are implemented
>>>>>> just to satisfy the requirement that SchemaTransformProvider::from
>>>>>> return a SchemaTransform.
>>>>>>
>>>>>> I'm bringing this up because if we are looking to encourage
>>>>>> contribution to cross-language use cases, we should make it simpler and
>>>>>> less convoluted to develop portable transforms.
>>>>>>
>>>>>> There are a number of SchemaTransforms already developed, but
>>>>>> applying these changes to them should be straightforward. If people think
>>>>>> this is a good idea, I can open a PR and implement them.
>>>>>>
>>>>>> Best,
>>>>>> Ahmed
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>>> [2]
>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>>> [3]
>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>>> [4]
>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>>> [5]
>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>>> [6]
>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>>
>>>>>

Reply via email to