Out of curiosity, in what cases would Schema.fields[index] not represent the encoding_position?
On Thu, Feb 6, 2020 at 12:57 AM Gleb Kanterov <g...@spotify.com> wrote: > Field ordering matters, for instance, for batch pipeline writing to a > non-partitioned BigQuery table. Each partition is a new table with own > schema. Each day a new table would have non-deterministic field ordering. > It's arguable if it's a good practice to define table schema using Java > class, even if field ordering was deterministic. Because schema definition > language is embedded into Java, it isn't as clear as for instance for > Protobuf, if a change keeps schema compatibility. However, I can see how > borrowing the concept of field numbers would make it more clear. > > A similar concern is relevant to streaming pipelines if there is no > "update" functionality or pipeline that needs to be drained and restarted. > > What are the requirements for updating streaming pipelines? Is it only > that encoding positions for existing fields shouldn't change? With that, I > don't understand how "sort all the fields" makes the "update" case worse. > As I see, it fixes writing to external storage, doesn't solve the problem > of "update", but doesn't make it worse. > > Gleb > > On Thu, Feb 6, 2020 at 6:01 AM Reuven Lax <re...@google.com> wrote: > >> Let's understand the use case first. >> >> My concern was with making SchemaCoder compatible between different >> invocations of a pipeline, and that's why I introduced encoding_position. >> This allows the field id to change, but we can preserve the same >> encoding_position. However this is internal to a pipeline. >> >> If the worry is writing rows to a sink, how are the rows being written? I >> would highly advise against using Beam's internal binary representation to >> write rows external to a pipeline. That representation is meant to be an >> internal detail of schemas, not a public binary format. Rows should be >> converted to some public format before being written. >> >> I wonder if a convenience method on Row - getValuesOrderedByName() - >> would be sufficient for this use case? >> >> Reuven >> >> On Wed, Feb 5, 2020 at 8:49 PM Kenneth Knowles <k...@apache.org> wrote: >> >>> Are we in danger of reinventing protobuf's practice of giving fields >>> numbers? (this practice itself almost certainly used decades before >>> protobufs creation). Could we just use the same practice? >>> >>> Schema fields already have integer IDs and "encoding_position" (see >>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto). >>> Are these the same as proto field numbers? Do we need both? What is the >>> expectation around how they interact? The proto needs >>> comments/documentation! >>> >>> This does not directly address the question, but any solution related to >>> how auto-generated schemas work should be specified in terms of the proto. >>> For example, annotations to suggest one or both of these fields. Or, >>> lacking that, sorting by name (giving up on "new fields come last" >>> behavior. Or warning that the schema is unstable. Etc. >>> >>> Kenn >>> >>> On Wed, Feb 5, 2020 at 10:47 AM Luke Cwik <lc...@google.com> wrote: >>> >>>> The Java compiler doesn't know about whether a field was added or >>>> removed when compiling source to class so there is no way for it to provide >>>> an ordering that puts "new" fields at the end and the source specification >>>> doesn't allow for users to state the field ordering that should be used. >>>> You can ask users to annotate a field ordering[1] using custom annotations >>>> but a general solution will require some type of sorting. >>>> >>>> 1: https://stackoverflow.com/a/1099389/4368200 >>>> >>>> On Wed, Feb 5, 2020 at 10:31 AM Reuven Lax <re...@google.com> wrote: >>>> >>>>> I have yet to figure out a way to make Schema inference >>>>> deterministically ordered, because Java reflection provides no guaranteed >>>>> ordering (I suspect that the JVM returns functions by iterating over a >>>>> hash >>>>> map, or something of that form). Ideas such as "sort all the fields" >>>>> actually makes things worse, because new fields will end up in the middle >>>>> of the field list. >>>>> >>>>> This is a problem for runners that support an "update" functionality. >>>>> Currently the solution I was working on was to allow the runner to inspect >>>>> the previous graph on an update, to ensure that we maintain the previous >>>>> order. >>>>> >>>>> If you know a way to ensure deterministic ordering, I would love to >>>>> know. I even went so far as to try and open the .class file to get members >>>>> in the order defined there, but that is very complex, error prone, and I >>>>> believe still doesn't guarantee order stability. >>>>> >>>>> On Wed, Feb 5, 2020 at 9:15 AM Robert Bradshaw <rober...@google.com> >>>>> wrote: >>>>> >>>>>> +1 to standardizing on a deterministic ordering for inference if none >>>>>> is imposed by the structure. >>>>>> >>>>>> On Wed, Feb 5, 2020, 8:55 AM Gleb Kanterov <g...@spotify.com> wrote: >>>>>> >>>>>>> There are Beam schema providers that use Java reflection to get >>>>>>> fields for classes with fields and auto-value classes. It isn't relevant >>>>>>> for POJOs with "creators", because function arguments are ordered. We >>>>>>> cache >>>>>>> instances of schema coders, but there is no guarantee that it's >>>>>>> deterministic between JVMs. As a result, I've seen cases when the >>>>>>> construction of pipeline graphs and output schema is non-deterministic. >>>>>>> It's especially relevant when writing data to external storage, where >>>>>>> row >>>>>>> schema becomes a table schema. There is a workaround to apply a >>>>>>> transform >>>>>>> that would make schema deterministic, for instance, by ordering fields >>>>>>> by >>>>>>> name. >>>>>>> >>>>>>> I would see a benefit in making schemas deterministic by default or >>>>>>> at least introducing a way to do so without writing custom code. What >>>>>>> are your thoughts? >>>>>>> >>>>>>