Hey everyone,

I was looking at how we use SchemaTransforms in our expansion service. From
what I see, there may be a redundant step in developing SchemaTransforms.
Currently, we have 3 pieces:
- SchemaTransformProvider [1]
- A configuration object
- SchemaTransform [2]

The API is generally used like this:
1. The SchemaTransformProvider takes a configuration object and returns a
SchemaTransform
2. The SchemaTransform is used to build a PTransform according to the
configuration

In these steps, the SchemaTransform class seems unnecessary. We can combine
the two steps if we have SchemaTransformProvider return the PTransform
directly.

We can then remove the SchemaTransform class as it will be obsolete. This
should be safe to do; the only place it's used in our API is here [3], and
that can be simplified if we make this change (we'd just trim `
.buildTransform()` off the end as `provider.from(configRow)` will directly
return the PTransform).

I'd like to first mention that I was not involved in the design process of
this API so I may be missing some information on why it was set up this way.

A few developers already raised questions about how there's seemingly
unnecessary boilerplate involved in making a Java transform portable. I
wasn't involved in the design process of this API so I may be missing some
information, but my assumption is this was designed to follow the pattern
of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
SchemaIO[5] -> PTransform. However, with the newer SchemaTransformProvider
API, we dropped a few methods and reduced the SchemaTransform class to have
a generic buildTransform() method. See the example of
PubsubReadSchemaTransformProvider [6], where the SchemaTransform interface
and buildTransform method are implemented just to satisfy the requirement
that SchemaTransformProvider::from return a SchemaTransform.

I'm bringing this up because if we are looking to encourage contribution to
cross-language use cases, we should make it simpler and less convoluted to
develop portable transforms.

There are a number of SchemaTransforms already developed, but applying
these changes to them should be straightforward. If people think this is a
good idea, I can open a PR and implement them.

Best,
Ahmed

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
[3]
https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
[4]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
[5]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
[6]
https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137

Reply via email to