I have yet to figure out a way to make Schema inference deterministically ordered, because Java reflection provides no guaranteed ordering (I suspect that the JVM returns functions by iterating over a hash map, or something of that form). Ideas such as "sort all the fields" actually makes things worse, because new fields will end up in the middle of the field list.
This is a problem for runners that support an "update" functionality. Currently the solution I was working on was to allow the runner to inspect the previous graph on an update, to ensure that we maintain the previous order. If you know a way to ensure deterministic ordering, I would love to know. I even went so far as to try and open the .class file to get members in the order defined there, but that is very complex, error prone, and I believe still doesn't guarantee order stability. On Wed, Feb 5, 2020 at 9:15 AM Robert Bradshaw <rober...@google.com> wrote: > +1 to standardizing on a deterministic ordering for inference if none is > imposed by the structure. > > On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov <g...@spotify.com> wrote: > >> There are Beam schema providers that use Java reflection to get fields >> for classes with fields and auto-value classes. It isn't relevant for POJOs >> with "creators", because function arguments are ordered. We cache instances >> of schema coders, but there is no guarantee that it's deterministic between >> JVMs. As a result, I've seen cases when the construction of pipeline graphs >> and output schema is non-deterministic. It's especially relevant when >> writing data to external storage, where row schema becomes a table schema. >> There is a workaround to apply a transform that would make schema >> deterministic, for instance, by ordering fields by name. >> >> I would see a benefit in making schemas deterministic by default or at >> least introducing a way to do so without writing custom code. What are your >> thoughts? >> >