MS-Kinguin opened a new issue, #30472:
URL: https://github.com/apache/beam/issues/30472
### What happened?
### Cause of the Problem
I'm experiencing a failure when reading a MongoDB collection on Atlas. The
issue arises specifically when the pipeline attempts to process documents where
the _id field contains mixed data types, such as strings and ObjectIds.
**Error Encountered:**
The pipeline fails during the phase where data is being read and chunked
from MongoDB, with the following error message:
`TypeError: '>=' not supported between instances of 'str' and 'ObjectId'`
This error suggests that the pipeline is attempting to compare or
range-partition _id fields of differing data types, leading to a type mismatch
and ultimately causing the failure.
**Detailed Traceback:**
The failure seems to occur in the apache_beam/io/mongodbio.py file,
specifically within the _get_auto_buckets and _range_is_not_splittable methods,
indicating that the issue happens during the auto bucket generation for data
chunking:
`File "/usr/local/lib/python3.8/site-packages/apache_beam/io/mongodbio.py",
line 443, in _range_is_not_splittable
(isinstance(start_pos, str) and start_pos >= end_pos))`
**Suggested Solution:**
While the immediate workaround could be to ensure uniform data types for _id
fields in MongoDB collections, this is not always feasible due to existing data
and third-party data sources. Therefore, it would be beneficial if the MongoDB
connector in Apache Beam could handle mixed data types in _id fields more
gracefully, perhaps by treating all _id values as strings for the purpose of
chunking or by providing a clear way to customize the chunking strategy.
**Steps to Reproduce:**
1. Set up a MongoDB collection with mixed data types in the _id field (e.g.,
some documents with string _id and others with ObjectId _id).
2. Create an Apache Beam pipeline that reads from this MongoDB collection.
3. Run the pipeline and observe the failure.
### Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
### Issue Components
- [X] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]