I can help review the Java change.
-Rui On Thu, Aug 6, 2020 at 9:53 AM Brian Hulette <[email protected]> wrote: > The PR for this is up now: https://github.com/apache/beam/pull/12481 > Any volunteers to help review? We may want a separate reviewer for Python > and Java changes. > > Brian > > On Wed, Aug 5, 2020 at 9:00 AM Brian Hulette <[email protected]> wrote: > >> What I'm working on changes ExternalConfigurationPayload [1] to this: >> >> message ExternalConfigurationPayload { >> // A schema for use in beam:coder:row:v1 >> Schema schema = 1; >> >> // A payload which can be decoded using beam:coder:row:v1 and the given >> schema. >> bytes payload = 2; >> } >> >> The calling SDK can infer a schema from a user configuration type (in >> Python we can make minor changes to SchemaBasedPayloadBuilder for this), >> and use it's implementation of beam:coder:row:v1 to encode an instance of >> that type to the payload. >> >> Similarly the expanding SDK can infer a schema from a user configuration >> type and map the encoded row to an instance of the user type, assuming the >> schemas are compatible. >> >> Brian >> >> [1] >> https://github.com/apache/beam/blob/86b8326b4ebc4e217585847102743cc1d1af367a/model/pipeline/src/main/proto/external_transforms.proto#L42 >> >> On Wed, Aug 5, 2020 at 2:04 AM Maximilian Michels <[email protected]> wrote: >> >>> +1 >>> >>> The format to store coders is not set in stone, it was a first version >>> to make external configuration work. Using the Coder message would be >>> better. >>> >>> As for using Schema to store the configuration, could somebody fill me >>> in how that would work? >>> >>> -Max >>> >>> On 04.08.20 02:01, Brian Hulette wrote: >>> > I've opened BEAM-10571 [1] for this, and I'm most of the way to an >>> > implementation now. Aiming to have it done before the 2.24.0 cut since >>> > it will be the last release with python 2 support. >>> > >>> > [1] https://issues.apache.org/jira/browse/BEAM-10571 >>> > >>> > On Wed, Jul 15, 2020 at 9:03 AM Chamikara Jayalath < >>> [email protected] >>> > <mailto:[email protected]>> wrote: >>> > >>> > >>> > >>> > On Fri, Jul 10, 2020 at 4:47 PM Robert Bradshaw < >>> [email protected] >>> > <mailto:[email protected]>> wrote: >>> > >>> > On Fri, Jul 10, 2020 at 4:36 PM Brian Hulette >>> > <[email protected] <mailto:[email protected]>> wrote: >>> > > >>> > > Ah yes I'm +1 for that approach too - it would let us >>> > leverage all the schema-inference already in the Java SDK for >>> > translating configuration objects which would be great. >>> > > Things on the Python side would be trickier as schemas don't >>> > formally support all the types you can use in the >>> PayloadBuilder >>> > implementations [1] yet, just NamedTuple. For now we could just >>> > make the PayloadBuilder implementations generate Rows without >>> > making that translation available for use in PCollections. >>> > >>> > >>> > This will be a good opportunity to add some sort of a minimal >>> Python >>> > type to Beam schema mapping :) >>> > >>> > >>> > Yes, though eventually it might be nice to support all of these >>> > various types as schema'd PCollection elements as well. >>> > >>> > > Do we need to worry about update compatibility for >>> > ExternalConfigurationPayload? >>> > >>> > Technically, each URN defines their payload, and the fact that >>> we've >>> > settled on ExternalConfigurationPayload is a convention. On a >>> > practical note, we haven't declared these protos stable yet. (I >>> > would >>> > like to do so before we drop support for Python 2, as external >>> > transforms are a possible escape hatch and the first strong >>> > motivation >>> > to have external transforms that span Beam versions). >>> > >>> > >>> > +1 >>> > >>> > >>> > > [1] >>> > >>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py >>> > > >>> > > On Fri, Jul 10, 2020 at 4:23 PM Robert Bradshaw >>> > <[email protected] <mailto:[email protected]>> wrote: >>> > >> >>> > >> I would be in favor of just using a schema to store the >>> entire >>> > >> configuration. The reason we went with what we have to day >>> > is that we >>> > >> didn't have cross language schemas yet. >>> > >> >>> > >> On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette >>> > <[email protected] <mailto:[email protected]>> wrote: >>> > >> > >>> > >> > Hi everyone, >>> > >> > I noticed that currently the ExternalConfigurationPayload >>> > uses a list of coder URNs to represent the coder that was used >>> > to serialize each configuration field [1]. This seems >>> acceptable >>> > at first blush, but there's one notable issue: it has no place >>> > to store a payload for the coder. Most standard coders don't >>> use >>> > a payload so it's not a problem, but row coder does use a >>> > payload to store it's schema, which means it can't be used in >>> an >>> > ExternalConfigurationPayload today. >>> > >> > >>> > >> > Is there a reason not to just use the Coder message [2] >>> in >>> > ExternalConfigurationPayload instead of a list of coder URNs? >>> > That would work with row coder, and it would also make it >>> easier >>> > to re-use logic for translating Pipeline protos. >>> > >> > >>> > >> > I'd be happy to make this change, but I wanted to ask on >>> > dev@ in case there's something I'm missing here. >>> > >> > >>> > >> > Brian >>> > >> > >>> > >> > [1] >>> > >>> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/external_transforms.proto#L34-L35 >>> > >> > [2] >>> > >>> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/beam_runner_api.proto#L542-L555 >>> > >>> >>
