Re: Deterministic field ordering in derived schemas

Reuven Lax Wed, 05 Feb 2020 21:01:55 -0800

Let's understand the use case first.

My concern was with making SchemaCoder compatible between different
invocations of a pipeline, and that's why I introduced encoding_position.
This allows the field id to change, but we can preserve the same
encoding_position. However this is internal to a pipeline.


If the worry is writing rows to a sink, how are the rows being written? I
would highly advise against using Beam's internal binary representation to
write rows external to a pipeline. That representation is meant to be an
internal detail of schemas, not a public binary format. Rows should be
converted to some public format before being written.

I wonder if a convenience method on Row - getValuesOrderedByName() - would
be sufficient for this use case?

Reuven

On Wed, Feb 5, 2020 at 8:49 PM Kenneth Knowles <k...@apache.org> wrote:

> Are we in danger of reinventing protobuf's practice of giving fields
> numbers? (this practice itself almost certainly used decades before
> protobufs creation). Could we just use the same practice?
>
> Schema fields already have integer IDs and "encoding_position" (see
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto).
> Are these the same as proto field numbers? Do we need both? What is the
> expectation around how they interact? The proto needs
> comments/documentation!
>
> This does not directly address the question, but any solution related to
> how auto-generated schemas work should be specified in terms of the proto.
> For example, annotations to suggest one or both of these fields. Or,
> lacking that, sorting by name (giving up on "new fields come last"
> behavior. Or warning that the schema is unstable. Etc.
>
> Kenn
>
> On Wed, Feb 5, 2020 at 10:47 AM Luke Cwik <lc...@google.com> wrote:
>
>> The Java compiler doesn't know about whether a field was added or removed
>> when compiling source to class so there is no way for it to provide an
>> ordering that puts "new" fields at the end and the source specification
>> doesn't allow for users to state the field ordering that should be used.
>> You can ask users to annotate a field ordering[1] using custom annotations
>> but a general solution will require some type of sorting.
>>
>> 1: https://stackoverflow.com/a/1099389/4368200
>>
>> On Wed, Feb 5, 2020 at 10:31 AM Reuven Lax <re...@google.com> wrote:
>>
>>> I have yet to figure out a way to make Schema inference
>>> deterministically ordered, because Java reflection provides no guaranteed
>>> ordering (I suspect that the JVM returns functions by iterating over a hash
>>> map, or something of that form). Ideas such as "sort all the fields"
>>> actually makes things worse, because new fields will end up in the middle
>>> of the field list.
>>>
>>> This is a problem for runners that support an "update" functionality.
>>> Currently the solution I was working on was to allow the runner to inspect
>>> the previous graph on an update, to ensure that we maintain the previous
>>> order.
>>>
>>> If you know a way to ensure deterministic ordering, I would love to
>>> know. I even went so far as to try and open the .class file to get members
>>> in the order defined there, but that is very complex, error prone, and I
>>> believe still doesn't guarantee order stability.
>>>
>>> On Wed, Feb 5, 2020 at 9:15 AM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>
>>>> +1 to standardizing on a deterministic ordering for inference if none
>>>> is imposed by the structure.
>>>>
>>>> On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov <g...@spotify.com> wrote:
>>>>
>>>>> There are Beam schema providers that use Java reflection to get fields
>>>>> for classes with fields and auto-value classes. It isn't relevant for 
>>>>> POJOs
>>>>> with "creators", because function arguments are ordered. We cache 
>>>>> instances
>>>>> of schema coders, but there is no guarantee that it's deterministic 
>>>>> between
>>>>> JVMs. As a result, I've seen cases when the construction of pipeline 
>>>>> graphs
>>>>> and output schema is non-deterministic. It's especially relevant when
>>>>> writing data to external storage, where row schema becomes a table schema.
>>>>> There is a workaround to apply a transform that would make schema
>>>>> deterministic, for instance, by ordering fields by name.
>>>>>
>>>>> I would see a benefit in making schemas deterministic by default or at
>>>>> least introducing a way to do so without writing custom code. What are 
>>>>> your
>>>>> thoughts?
>>>>>
>>>>

Re: Deterministic field ordering in derived schemas

Reply via email to