dingyufei615 opened a new pull request, #351:
URL: https://github.com/apache/doris-spark-connector/pull/351
# [Fix] Fix Arrow field count mismatch with schema causing read failures
# Proposed changes
Issue Number: close #349
## Problem Summary:
Fixed the `DorisException: Load Doris data failed, schema size of fetch data
is wrong` error that occurs when reading data from Doris 2.0+ using Spark Doris
Connector, caused by Arrow returning more fields than defined in the schema.
### Root Cause
Doris 2.0+ includes internal system columns (such as
`__DORIS_DELETE_SIGN__`) in the Arrow data stream for certain table types
(e.g., Unique Key tables). These columns are used for Merge-on-Read
implementation but should not be visible to users. The original strict
validation logic `fieldVectors.size() > schema.size()` would throw an exception
immediately, preventing normal data reading.
### Solution
1. **Modified validation logic**: Changed the strict `>` check to only throw
exceptions when `fieldVectors.size() < schema.size()` (actual error scenario)
2. **Compatible with internal columns**: Log a warning instead of throwing
exception when `fieldVectors.size() > schema.size()`
3. **Process only user columns**: In both `readBatch()` and
`convertArrowToRowBatch()`, only process columns defined in schema, ignoring
extra internal columns
### Changes Made
**File**:
`spark-doris-connector-base/src/main/java/org/apache/doris/spark/client/read/RowBatch.java`
1. `readBatch()` method:
- Reversed validation logic to only throw exception when fields are
insufficient
- Log warning instead of throwing exception when fields exceed schema size
- Use `schema.size()` instead of `fieldVectors.size()` to initialize Row
objects
2. `convertArrowToRowBatch()` method:
- Loop only processes `schema.size()` fields, ignoring extra internal
columns
## Checklist(Required)
1. Does it affect the original behavior: **Yes** - Fixes the issue where
Doris 2.0+ data cannot be read, making the connector compatible with Arrow data
streams containing internal columns
2. Has unit tests been added: **No Need** - This is a fix for existing
logic; existing tests cover core functionality
3. Has document been added or modified: **No Need** - This is an internal
implementation fix that doesn't affect user-facing APIs
4. Does it need to update dependencies: **No** - No dependency changes
5. Are there any changes that cannot be rolled back: **No** - Can be safely
rolled back
## Further comments
### Testing & Verification
This fix has been verified in the following environment:
- **Doris Version**: 2.0.x
- **Spark Version**: 3.3
- **Table Type**: Unique Key tables (containing `__DORIS_DELETE_SIGN__`
internal column)
### Impact Scope
- **Benefited scenarios**: All read operations using Doris 2.0+ with tables
containing internal system columns
- **Backward compatibility**: Fully compatible with older Doris versions, no
impact on existing functionality
- **Performance impact**: No performance impact, only adjusted field
processing logic
### Related Issues
This issue has been reported in the community:
- Issue #349: Reports the same field count mismatch problem caused by
`__DORIS_DELETE_SIGN__`
- Error message: `org.apache.doris.spark.exception.DorisException: Load
Doris data failed, schema size of fetch data is wrong`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]