I have yet to figure out a way to make Schema inference deterministically
ordered, because Java reflection provides no guaranteed ordering (I suspect
that the JVM returns functions by iterating over a hash map, or something
of that form). Ideas such as "sort all the fields" actually makes things
worse, because new fields will end up in the middle of the field list.

This is a problem for runners that support an "update" functionality.
Currently the solution I was working on was to allow the runner to inspect
the previous graph on an update, to ensure that we maintain the previous
order.

If you know a way to ensure deterministic ordering, I would love to know. I
even went so far as to try and open the .class file to get members in the
order defined there, but that is very complex, error prone, and I believe
still doesn't guarantee order stability.

On Wed, Feb 5, 2020 at 9:15 AM Robert Bradshaw <rober...@google.com> wrote:

> +1 to standardizing on a deterministic ordering for inference if none is
> imposed by the structure.
>
> On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov <g...@spotify.com> wrote:
>
>> There are Beam schema providers that use Java reflection to get fields
>> for classes with fields and auto-value classes. It isn't relevant for POJOs
>> with "creators", because function arguments are ordered. We cache instances
>> of schema coders, but there is no guarantee that it's deterministic between
>> JVMs. As a result, I've seen cases when the construction of pipeline graphs
>> and output schema is non-deterministic. It's especially relevant when
>> writing data to external storage, where row schema becomes a table schema.
>> There is a workaround to apply a transform that would make schema
>> deterministic, for instance, by ordering fields by name.
>>
>> I would see a benefit in making schemas deterministic by default or at
>> least introducing a way to do so without writing custom code. What are your
>> thoughts?
>>
>

Reply via email to