Re: Deterministic field ordering in derived schemas

Kenneth Knowles Wed, 05 Feb 2020 20:49:40 -0800

Are we in danger of reinventing protobuf's practice of giving fields
numbers? (this practice itself almost certainly used decades before
protobufs creation). Could we just use the same practice?


Schema fields already have integer IDs and "encoding_position" (see
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto).
Are these the same as proto field numbers? Do we need both? What is the
expectation around how they interact? The proto needs
comments/documentation!

This does not directly address the question, but any solution related to
how auto-generated schemas work should be specified in terms of the proto.
For example, annotations to suggest one or both of these fields. Or,
lacking that, sorting by name (giving up on "new fields come last"
behavior. Or warning that the schema is unstable. Etc.

Kenn

On Wed, Feb 5, 2020 at 10:47 AM Luke Cwik <lc...@google.com> wrote:

> The Java compiler doesn't know about whether a field was added or removed
> when compiling source to class so there is no way for it to provide an
> ordering that puts "new" fields at the end and the source specification
> doesn't allow for users to state the field ordering that should be used.
> You can ask users to annotate a field ordering[1] using custom annotations
> but a general solution will require some type of sorting.
>
> 1: https://stackoverflow.com/a/1099389/4368200
>
> On Wed, Feb 5, 2020 at 10:31 AM Reuven Lax <re...@google.com> wrote:
>
>> I have yet to figure out a way to make Schema inference deterministically
>> ordered, because Java reflection provides no guaranteed ordering (I suspect
>> that the JVM returns functions by iterating over a hash map, or something
>> of that form). Ideas such as "sort all the fields" actually makes things
>> worse, because new fields will end up in the middle of the field list.
>>
>> This is a problem for runners that support an "update" functionality.
>> Currently the solution I was working on was to allow the runner to inspect
>> the previous graph on an update, to ensure that we maintain the previous
>> order.
>>
>> If you know a way to ensure deterministic ordering, I would love to know.
>> I even went so far as to try and open the .class file to get members in the
>> order defined there, but that is very complex, error prone, and I believe
>> still doesn't guarantee order stability.
>>
>> On Wed, Feb 5, 2020 at 9:15 AM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> +1 to standardizing on a deterministic ordering for inference if none is
>>> imposed by the structure.
>>>
>>> On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov <g...@spotify.com> wrote:
>>>
>>>> There are Beam schema providers that use Java reflection to get fields
>>>> for classes with fields and auto-value classes. It isn't relevant for POJOs
>>>> with "creators", because function arguments are ordered. We cache instances
>>>> of schema coders, but there is no guarantee that it's deterministic between
>>>> JVMs. As a result, I've seen cases when the construction of pipeline graphs
>>>> and output schema is non-deterministic. It's especially relevant when
>>>> writing data to external storage, where row schema becomes a table schema.
>>>> There is a workaround to apply a transform that would make schema
>>>> deterministic, for instance, by ordering fields by name.
>>>>
>>>> I would see a benefit in making schemas deterministic by default or at
>>>> least introducing a way to do so without writing custom code. What are your
>>>> thoughts?
>>>>
>>>

Re: Deterministic field ordering in derived schemas

Reply via email to