[I] [Bug]: Schema inference is non-deterministic [beam]

via GitHub Fri, 09 Feb 2024 01:44:33 -0800


iht opened a new issue, #30276:
URL: https://github.com/apache/beam/issues/30276


   ### What happened?
   
   When using the inference of schema with a POJO, a Java Bean or an AutoValue 
class, the order of the fields in the schema may change during the inference 
process. This can cause issues with updates in the runners that support them 
(e.g. Cloud Dataflow).
   
   Because the schemas are different (same fields, but in different order), the 
coders used for those `PCollection`s will be different too, and reading the 
data from the previous job's checkpoint will fail. The job will process new 
elements fine, but the old checkpoint elements will throw decode errors.
   
   Example of the same AutoValue class with two different runs, omitting the 
fields part for brevity. In the first run, this is the inferred schema:
   
   ```
   [...]
   Encoding positions:
   {sequenceNumber=12, destination=11, messageId=7, receiveTimestamp=5, 
priority=1, senderTimestamp=6, redelivered=4, timeToLive=3, payload=2, 
replyTo=8, expiration=10, replicationGroupMessageId=9, properties=0}
   [...]
   ```
   
   In the second run, no changes in the code, but the code is recompiled again, 
this is the schema:
   
   ```
   [...]
   Encoding positions:
   {sequenceNumber=12, destination=11, messageId=8, receiveTimestamp=6, 
priority=1, senderTimestamp=7, redelivered=5, timeToLive=3, payload=2, 
replyTo=9, expiration=10, replicationGroupMessageId=4, properties=0}
   [...]
   ```
   
   Notice how for instance `messageId` is detected as the 7th field in the 
first schema, and the 8th in the second.
   
   The problem seems to be [in the usage of `Class.getDeclaredMethods` in the 
reflect 
utils](https://github.com/apache/beam/blob/e9202abb7ad8b2041a257fb3f50337a2f03127b1/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ReflectUtils.java#L78).
   
   The documentation of `getDeclaredMethods` ([doc for Java 
8](https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getDeclaredMethods--),
 [doc for Java 
17](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Class.html#getDeclaredMethods()))
 says that: "The elements in the returned array are not sorted and are not in 
any particular order.".
   
   So in summary, the schema inference is non-deterministic.
   
   As a workaround, you can use the annotation `@SchemaFieldNumber` to override 
the order of the fields, but it is far from obvious that if you create a data 
class without changing the order of fields in the code, the order may not 
always be the same. 
   
   So I would suggest for instance ordering the getter methods 
lexicographically (which is deterministic across runs using the same code), and 
clearly adding that to the documentation. The main drawback would be that 
adding a field, even if it is nullable, would change the order if fields are 
sorted lexicographically. But if this clearly documented, users can always add 
the `@SchemaFieldNumber` to the new fields, to ensure they are appended at the 
end of the fields positions.
   
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [X] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug]: Schema inference is non-deterministic [beam]

Reply via email to