atheendre130505 opened a new issue, #37575:
URL: https://github.com/apache/beam/issues/37575
### What happened?
Description: Beam YAML's JSON schema compatibility validation for objects is
effectively disabled due to a logic error in json_utils.py.
The function _validate_compatible (used by row_validator) attempts to check
if a Beam schema is compatible with a provided JSON schema. However, it
contains several "simple code" bugs:
It compares the weak_schema dictionary directly to the string 'object'
instead of checking its
type field.
It attempts to unpack a dictionary during iteration without calling .items().
It uses improper string formatting for error messages, leading to unhelpful
or crashing error reports.
As a result, Validate transforms in Beam YAML may silently proceed even when
schemas are fundamentally incompatible, or fail with distracting internal
tracebacks.
Beam Version: 2.61.x (Python SDK)
Steps to reproduce
Run the following Python snippet. It should raise a ValueError about
incompatible types ('string' vs 'integer'), but it currently completes
successfully because the validation logic is skipped.
from apache_beam.yaml import json_utils
from apache_beam.portability.api import schema_pb2
from apache_beam.typehints import schemas
# A schema with a string field
beam_schema = schema_pb2.Schema(fields=[
schemas.schema_field('f', schema_pb2.STRING)
])
# An incompatible JSON schema expecting an integer for the same field
json_schema = {
'type': 'object',
'properties': {
'f': {'type': 'integer'}
}
}
# This SHOULD fail, but silently succeeds due to logic error in json_utils.py
json_utils.row_validator(beam_schema, json_schema)
### Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
### Issue Components
- [x] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [x] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Infrastructure
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]