Re: Deterministic field ordering in derived schemas

Luke Cwik Thu, 06 Feb 2020 08:54:45 -0800

Out of curiosity, in what cases would Schema.fields[index] not represent
the encoding_position?


On Thu, Feb 6, 2020 at 12:57 AM Gleb Kanterov <g...@spotify.com> wrote:

> Field ordering matters, for instance, for batch pipeline writing to a
> non-partitioned BigQuery table. Each partition is a new table with own
> schema. Each day a new table would have non-deterministic field ordering.
> It's arguable if it's a good practice to define table schema using Java
> class, even if field ordering was deterministic. Because schema definition
> language is embedded into Java, it isn't as clear as for instance for
> Protobuf, if a change keeps schema compatibility. However, I can see how
> borrowing the concept of field numbers would make it more clear.
>
> A similar concern is relevant to streaming pipelines if there is no
> "update" functionality or pipeline that needs to be drained and restarted.
>
> What are the requirements for updating streaming pipelines? Is it only
> that encoding positions for existing fields shouldn't change? With that, I
> don't understand how "sort all the fields" makes the "update" case worse.
> As I see, it fixes writing to external storage, doesn't solve the problem
> of "update", but doesn't make it worse.
>
> Gleb
>
> On Thu, Feb 6, 2020 at 6:01 AM Reuven Lax <re...@google.com> wrote:
>
>> Let's understand the use case first.
>>
>> My concern was with making SchemaCoder compatible between different
>> invocations of a pipeline, and that's why I introduced encoding_position.
>> This allows the field id to change, but we can preserve the same
>> encoding_position. However this is internal to a pipeline.
>>
>> If the worry is writing rows to a sink, how are the rows being written? I
>> would highly advise against using Beam's internal binary representation to
>> write rows external to a pipeline. That representation is meant to be an
>> internal detail of schemas, not a public binary format. Rows should be
>> converted to some public format before being written.
>>
>> I wonder if a convenience method on Row - getValuesOrderedByName() -
>> would be sufficient for this use case?
>>
>> Reuven
>>
>> On Wed, Feb 5, 2020 at 8:49 PM Kenneth Knowles <k...@apache.org> wrote:
>>
>>> Are we in danger of reinventing protobuf's practice of giving fields
>>> numbers? (this practice itself almost certainly used decades before
>>> protobufs creation). Could we just use the same practice?
>>>
>>> Schema fields already have integer IDs and "encoding_position" (see
>>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto).
>>> Are these the same as proto field numbers? Do we need both? What is the
>>> expectation around how they interact? The proto needs
>>> comments/documentation!
>>>
>>> This does not directly address the question, but any solution related to
>>> how auto-generated schemas work should be specified in terms of the proto.
>>> For example, annotations to suggest one or both of these fields. Or,
>>> lacking that, sorting by name (giving up on "new fields come last"
>>> behavior. Or warning that the schema is unstable. Etc.
>>>
>>> Kenn
>>>
>>> On Wed, Feb 5, 2020 at 10:47 AM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> The Java compiler doesn't know about whether a field was added or
>>>> removed when compiling source to class so there is no way for it to provide
>>>> an ordering that puts "new" fields at the end and the source specification
>>>> doesn't allow for users to state the field ordering that should be used.
>>>> You can ask users to annotate a field ordering[1] using custom annotations
>>>> but a general solution will require some type of sorting.
>>>>
>>>> 1: https://stackoverflow.com/a/1099389/4368200
>>>>
>>>> On Wed, Feb 5, 2020 at 10:31 AM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> I have yet to figure out a way to make Schema inference
>>>>> deterministically ordered, because Java reflection provides no guaranteed
>>>>> ordering (I suspect that the JVM returns functions by iterating over a 
>>>>> hash
>>>>> map, or something of that form). Ideas such as "sort all the fields"
>>>>> actually makes things worse, because new fields will end up in the middle
>>>>> of the field list.
>>>>>
>>>>> This is a problem for runners that support an "update" functionality.
>>>>> Currently the solution I was working on was to allow the runner to inspect
>>>>> the previous graph on an update, to ensure that we maintain the previous
>>>>> order.
>>>>>
>>>>> If you know a way to ensure deterministic ordering, I would love to
>>>>> know. I even went so far as to try and open the .class file to get members
>>>>> in the order defined there, but that is very complex, error prone, and I
>>>>> believe still doesn't guarantee order stability.
>>>>>
>>>>> On Wed, Feb 5, 2020 at 9:15 AM Robert Bradshaw <rober...@google.com>
>>>>> wrote:
>>>>>
>>>>>> +1 to standardizing on a deterministic ordering for inference if none
>>>>>> is imposed by the structure.
>>>>>>
>>>>>> On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov <g...@spotify.com> wrote:
>>>>>>
>>>>>>> There are Beam schema providers that use Java reflection to get
>>>>>>> fields for classes with fields and auto-value classes. It isn't relevant
>>>>>>> for POJOs with "creators", because function arguments are ordered. We 
>>>>>>> cache
>>>>>>> instances of schema coders, but there is no guarantee that it's
>>>>>>> deterministic between JVMs. As a result, I've seen cases when the
>>>>>>> construction of pipeline graphs and output schema is non-deterministic.
>>>>>>> It's especially relevant when writing data to external storage, where 
>>>>>>> row
>>>>>>> schema becomes a table schema. There is a workaround to apply a 
>>>>>>> transform
>>>>>>> that would make schema deterministic, for instance, by ordering fields 
>>>>>>> by
>>>>>>> name.
>>>>>>>
>>>>>>> I would see a benefit in making schemas deterministic by default or
>>>>>>> at least introducing a way to do so without writing custom code. What
>>>>>>> are your thoughts?
>>>>>>>
>>>>>>

Re: Deterministic field ordering in derived schemas

Reply via email to