Re: [Question] Beam Schema, fields order

Alexey Romanenko Tue, 05 Apr 2022 13:05:13 -0700

Thanks for answers, Reuven. Please see the additional questions inline.

> On 5 Apr 2022, at 20:07, Reuven Lax <[email protected]> wrote:
> 
> On Tue, Apr 5, 2022 at 9:55 AM Alexey Romanenko <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> So, the different fields order matters.
> 
> Additionally, since "Schema.equals()” is used in "Row.equals()”, then it 
> means that two Rows with different-ordered schemas but the same values will 
> be considered as different rows. Is it correct?
> 
> Yes, but there are ways of dealing with this:


But what is a point of this? Why the fields order can be important, under which 
circumstances?

> 1. If using Dataflow, the pipeline update feature allows you to update to a 
> compatible schema (i.e. one in which the fields have the same names but a 
> different order)
> 2.You can use the Convert transform to convert rows to a compatible schema 
> with a different order.

Well, for now it’s mostly related to unit tests (e.g. 
AvroSchemaTest.testPojoRecordToRow()) when we compare a manually created row 
with another row that is created from a POJO with AvroRecordSchema. I’m playing 
with an Avro version upgrade [1] and it fails because there are some changes in 
Avro and it creates an Avro schema with a different order of fields. So, 
actually I’m thinking what we can do here with that.

[1] https://github.com/apache/beam/pull/17246

> 
> In the same time, while generating a schema with different schema providers, 
> the order of fields can be non-deterministic for some cases.
> 
> For example, “GetterBasedSchemaProvider.toRowFunction(TypeDescriptor)” says 
> [3] that:
> - “schemaFor is non deterministic - it might return fields in an arbitrary 
> order. The reason why is that Java reflection does not guarantee the order in 
> which it returns fields and methods, and these schemas are often based on 
> reflective analysis of classes. “
> 
> So, iiuc, it means that potentially we can have the "same" schema but with 
> different fields order for the same, for example, POJO class but generated on 
> different JVMs. 
> 
> Correct, and see above.
>  
> 
> And actually the questions: 
> - Two Rows with the same field values but with two schemas of different 
> fields order should be considered as two different rows or not?
> - This behaviour explained above - is this that was expected by initial 
> schema design? 
> - If fields order is so important then why?
> 
> PS: My question is actually related to "AvroRecordSchema().toRowFunction()” 
> but I guess other SchemaProvider’s also can be affected.
> 
> 
> —
> Alexey
> 
> [1] 
> https://beam.apache.org/documentation/programming-guide/#schema-definition 
> <https://beam.apache.org/documentation/programming-guide/#schema-definition>
> [2] 
> https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L303
>  
> <https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L303>
> [3] 
> https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/GetterBasedSchemaProvider.java#L91
>  
> <https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/GetterBasedSchemaProvider.java#L91>

Re: [Question] Beam Schema, fields order

Reply via email to