Let's understand the use case first. My concern was with making SchemaCoder compatible between different invocations of a pipeline, and that's why I introduced encoding_position. This allows the field id to change, but we can preserve the same encoding_position. However this is internal to a pipeline.
If the worry is writing rows to a sink, how are the rows being written? I would highly advise against using Beam's internal binary representation to write rows external to a pipeline. That representation is meant to be an internal detail of schemas, not a public binary format. Rows should be converted to some public format before being written. I wonder if a convenience method on Row - getValuesOrderedByName() - would be sufficient for this use case? Reuven On Wed, Feb 5, 2020 at 8:49 PM Kenneth Knowles <k...@apache.org> wrote: > Are we in danger of reinventing protobuf's practice of giving fields > numbers? (this practice itself almost certainly used decades before > protobufs creation). Could we just use the same practice? > > Schema fields already have integer IDs and "encoding_position" (see > https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto). > Are these the same as proto field numbers? Do we need both? What is the > expectation around how they interact? The proto needs > comments/documentation! > > This does not directly address the question, but any solution related to > how auto-generated schemas work should be specified in terms of the proto. > For example, annotations to suggest one or both of these fields. Or, > lacking that, sorting by name (giving up on "new fields come last" > behavior. Or warning that the schema is unstable. Etc. > > Kenn > > On Wed, Feb 5, 2020 at 10:47 AM Luke Cwik <lc...@google.com> wrote: > >> The Java compiler doesn't know about whether a field was added or removed >> when compiling source to class so there is no way for it to provide an >> ordering that puts "new" fields at the end and the source specification >> doesn't allow for users to state the field ordering that should be used. >> You can ask users to annotate a field ordering[1] using custom annotations >> but a general solution will require some type of sorting. >> >> 1: https://stackoverflow.com/a/1099389/4368200 >> >> On Wed, Feb 5, 2020 at 10:31 AM Reuven Lax <re...@google.com> wrote: >> >>> I have yet to figure out a way to make Schema inference >>> deterministically ordered, because Java reflection provides no guaranteed >>> ordering (I suspect that the JVM returns functions by iterating over a hash >>> map, or something of that form). Ideas such as "sort all the fields" >>> actually makes things worse, because new fields will end up in the middle >>> of the field list. >>> >>> This is a problem for runners that support an "update" functionality. >>> Currently the solution I was working on was to allow the runner to inspect >>> the previous graph on an update, to ensure that we maintain the previous >>> order. >>> >>> If you know a way to ensure deterministic ordering, I would love to >>> know. I even went so far as to try and open the .class file to get members >>> in the order defined there, but that is very complex, error prone, and I >>> believe still doesn't guarantee order stability. >>> >>> On Wed, Feb 5, 2020 at 9:15 AM Robert Bradshaw <rober...@google.com> >>> wrote: >>> >>>> +1 to standardizing on a deterministic ordering for inference if none >>>> is imposed by the structure. >>>> >>>> On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov <g...@spotify.com> wrote: >>>> >>>>> There are Beam schema providers that use Java reflection to get fields >>>>> for classes with fields and auto-value classes. It isn't relevant for >>>>> POJOs >>>>> with "creators", because function arguments are ordered. We cache >>>>> instances >>>>> of schema coders, but there is no guarantee that it's deterministic >>>>> between >>>>> JVMs. As a result, I've seen cases when the construction of pipeline >>>>> graphs >>>>> and output schema is non-deterministic. It's especially relevant when >>>>> writing data to external storage, where row schema becomes a table schema. >>>>> There is a workaround to apply a transform that would make schema >>>>> deterministic, for instance, by ordering fields by name. >>>>> >>>>> I would see a benefit in making schemas deterministic by default or at >>>>> least introducing a way to do so without writing custom code. What are >>>>> your >>>>> thoughts? >>>>> >>>>