Re: Deterministic field ordering in derived schemas

Luke Cwik Wed, 05 Feb 2020 10:48:20 -0800

The Java compiler doesn't know about whether a field was added or removed
when compiling source to class so there is no way for it to provide an
ordering that puts "new" fields at the end and the source specification
doesn't allow for users to state the field ordering that should be used.
You can ask users to annotate a field ordering[1] using custom annotations
but a general solution will require some type of sorting.


1: https://stackoverflow.com/a/1099389/4368200

On Wed, Feb 5, 2020 at 10:31 AM Reuven Lax <[email protected]> wrote:

> I have yet to figure out a way to make Schema inference deterministically
> ordered, because Java reflection provides no guaranteed ordering (I suspect
> that the JVM returns functions by iterating over a hash map, or something
> of that form). Ideas such as "sort all the fields" actually makes things
> worse, because new fields will end up in the middle of the field list.
>
> This is a problem for runners that support an "update" functionality.
> Currently the solution I was working on was to allow the runner to inspect
> the previous graph on an update, to ensure that we maintain the previous
> order.
>
> If you know a way to ensure deterministic ordering, I would love to know.
> I even went so far as to try and open the .class file to get members in the
> order defined there, but that is very complex, error prone, and I believe
> still doesn't guarantee order stability.
>
> On Wed, Feb 5, 2020 at 9:15 AM Robert Bradshaw <[email protected]>
> wrote:
>
>> +1 to standardizing on a deterministic ordering for inference if none is
>> imposed by the structure.
>>
>> On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov <[email protected]> wrote:
>>
>>> There are Beam schema providers that use Java reflection to get fields
>>> for classes with fields and auto-value classes. It isn't relevant for POJOs
>>> with "creators", because function arguments are ordered. We cache instances
>>> of schema coders, but there is no guarantee that it's deterministic between
>>> JVMs. As a result, I've seen cases when the construction of pipeline graphs
>>> and output schema is non-deterministic. It's especially relevant when
>>> writing data to external storage, where row schema becomes a table schema.
>>> There is a workaround to apply a transform that would make schema
>>> deterministic, for instance, by ordering fields by name.
>>>
>>> I would see a benefit in making schemas deterministic by default or at
>>> least introducing a way to do so without writing custom code. What are your
>>> thoughts?
>>>
>>

Re: Deterministic field ordering in derived schemas

Reply via email to