Re: Deterministic field ordering in derived schemas

Reuven Lax Thu, 06 Feb 2020 10:38:19 -0800

On Thu, Feb 6, 2020 at 12:57 AM Gleb Kanterov <g...@spotify.com> wrote:


> Field ordering matters, for instance, for batch pipeline writing to a
> non-partitioned BigQuery table. Each partition is a new table with own
> schema. Each day a new table would have non-deterministic field ordering.
> It's arguable if it's a good practice to define table schema using Java
> class, even if field ordering was deterministic. Because schema definition
> language is embedded into Java, it isn't as clear as for instance for
> Protobuf, if a change keeps schema compatibility. However, I can see how
> borrowing the concept of field numbers would make it more clear.
>

We support inferring schemas from Protobuf, in which case we make sure that
the schema order matches the protobuf field numbers. As you say, this only
happens when inferring a schema from a Java POJO or Bean class (such as
AutoValue), in which case you have to be careful when writing out to
external systems.

Another possibility - we already support a @SchemaFieldName annotation
which you can put on the getter method to provide a different schema name
than the one automatically inferred from the function definition. It would
be fairly trivial to add a @SchemaFieldNumber annotation to provide a
deterministic ordering of the fields, similar to what you have in protobuf.


>
> A similar concern is relevant to streaming pipelines if there is no
> "update" functionality or pipeline that needs to be drained and restarted.
>
> What are the requirements for updating streaming pipelines? Is it only
> that encoding positions for existing fields shouldn't change? With that, I
> don't understand how "sort all the fields" makes the "update" case worse.
> As I see, it fixes writing to external storage, doesn't solve the problem
> of "update", but doesn't make it worse.
>

Yes. Originally we didn't have encoding positions. Once they're hooked up
to runners (I don't think any runner has integrated yet), then sorting
fields should be fine because we can ensure that the encoding positions
don't change.

>
> Gleb
>
> On Thu, Feb 6, 2020 at 6:01 AM Reuven Lax <re...@google.com> wrote:
>
>> Let's understand the use case first.
>>
>> My concern was with making SchemaCoder compatible between different
>> invocations of a pipeline, and that's why I introduced encoding_position.
>> This allows the field id to change, but we can preserve the same
>> encoding_position. However this is internal to a pipeline.
>>
>> If the worry is writing rows to a sink, how are the rows being written? I
>> would highly advise against using Beam's internal binary representation to
>> write rows external to a pipeline. That representation is meant to be an
>> internal detail of schemas, not a public binary format. Rows should be
>> converted to some public format before being written.
>>
>> I wonder if a convenience method on Row - getValuesOrderedByName() -
>> would be sufficient for this use case?
>>
>> Reuven
>>
>> On Wed, Feb 5, 2020 at 8:49 PM Kenneth Knowles <k...@apache.org> wrote:
>>
>>> Are we in danger of reinventing protobuf's practice of giving fields
>>> numbers? (this practice itself almost certainly used decades before
>>> protobufs creation). Could we just use the same practice?
>>>
>>> Schema fields already have integer IDs and "encoding_position" (see
>>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto).
>>> Are these the same as proto field numbers? Do we need both? What is the
>>> expectation around how they interact? The proto needs
>>> comments/documentation!
>>>
>>> This does not directly address the question, but any solution related to
>>> how auto-generated schemas work should be specified in terms of the proto.
>>> For example, annotations to suggest one or both of these fields. Or,
>>> lacking that, sorting by name (giving up on "new fields come last"
>>> behavior. Or warning that the schema is unstable. Etc.
>>>
>>> Kenn
>>>
>>> On Wed, Feb 5, 2020 at 10:47 AM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> The Java compiler doesn't know about whether a field was added or
>>>> removed when compiling source to class so there is no way for it to provide
>>>> an ordering that puts "new" fields at the end and the source specification
>>>> doesn't allow for users to state the field ordering that should be used.
>>>> You can ask users to annotate a field ordering[1] using custom annotations
>>>> but a general solution will require some type of sorting.
>>>>
>>>> 1: https://stackoverflow.com/a/1099389/4368200
>>>>
>>>> On Wed, Feb 5, 2020 at 10:31 AM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> I have yet to figure out a way to make Schema inference
>>>>> deterministically ordered, because Java reflection provides no guaranteed
>>>>> ordering (I suspect that the JVM returns functions by iterating over a 
>>>>> hash
>>>>> map, or something of that form). Ideas such as "sort all the fields"
>>>>> actually makes things worse, because new fields will end up in the middle
>>>>> of the field list.
>>>>>
>>>>> This is a problem for runners that support an "update" functionality.
>>>>> Currently the solution I was working on was to allow the runner to inspect
>>>>> the previous graph on an update, to ensure that we maintain the previous
>>>>> order.
>>>>>
>>>>> If you know a way to ensure deterministic ordering, I would love to
>>>>> know. I even went so far as to try and open the .class file to get members
>>>>> in the order defined there, but that is very complex, error prone, and I
>>>>> believe still doesn't guarantee order stability.
>>>>>
>>>>> On Wed, Feb 5, 2020 at 9:15 AM Robert Bradshaw <rober...@google.com>
>>>>> wrote:
>>>>>
>>>>>> +1 to standardizing on a deterministic ordering for inference if none
>>>>>> is imposed by the structure.
>>>>>>
>>>>>> On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov <g...@spotify.com> wrote:
>>>>>>
>>>>>>> There are Beam schema providers that use Java reflection to get
>>>>>>> fields for classes with fields and auto-value classes. It isn't relevant
>>>>>>> for POJOs with "creators", because function arguments are ordered. We 
>>>>>>> cache
>>>>>>> instances of schema coders, but there is no guarantee that it's
>>>>>>> deterministic between JVMs. As a result, I've seen cases when the
>>>>>>> construction of pipeline graphs and output schema is non-deterministic.
>>>>>>> It's especially relevant when writing data to external storage, where 
>>>>>>> row
>>>>>>> schema becomes a table schema. There is a workaround to apply a 
>>>>>>> transform
>>>>>>> that would make schema deterministic, for instance, by ordering fields 
>>>>>>> by
>>>>>>> name.
>>>>>>>
>>>>>>> I would see a benefit in making schemas deterministic by default or
>>>>>>> at least introducing a way to do so without writing custom code. What
>>>>>>> are your thoughts?
>>>>>>>
>>>>>>

Re: Deterministic field ordering in derived schemas

Reply via email to