dingyufei615 opened a new pull request, #351:
URL: https://github.com/apache/doris-spark-connector/pull/351

   # [Fix] Fix Arrow field count mismatch with schema causing read failures
   
   # Proposed changes
   
   Issue Number: close #349
   
   ## Problem Summary:
   
   Fixed the `DorisException: Load Doris data failed, schema size of fetch data 
is wrong` error that occurs when reading data from Doris 2.0+ using Spark Doris 
Connector, caused by Arrow returning more fields than defined in the schema.
   
   ### Root Cause
   
   Doris 2.0+ includes internal system columns (such as 
`__DORIS_DELETE_SIGN__`) in the Arrow data stream for certain table types 
(e.g., Unique Key tables). These columns are used for Merge-on-Read 
implementation but should not be visible to users. The original strict 
validation logic `fieldVectors.size() > schema.size()` would throw an exception 
immediately, preventing normal data reading.
   
   ### Solution
   
   1. **Modified validation logic**: Changed the strict `>` check to only throw 
exceptions when `fieldVectors.size() < schema.size()` (actual error scenario)
   2. **Compatible with internal columns**: Log a warning instead of throwing 
exception when `fieldVectors.size() > schema.size()`
   3. **Process only user columns**: In both `readBatch()` and 
`convertArrowToRowBatch()`, only process columns defined in schema, ignoring 
extra internal columns
   
   ### Changes Made
   
   **File**: 
`spark-doris-connector-base/src/main/java/org/apache/doris/spark/client/read/RowBatch.java`
   
   1. `readBatch()` method:
      - Reversed validation logic to only throw exception when fields are 
insufficient
      - Log warning instead of throwing exception when fields exceed schema size
      - Use `schema.size()` instead of `fieldVectors.size()` to initialize Row 
objects
   
   2. `convertArrowToRowBatch()` method:
      - Loop only processes `schema.size()` fields, ignoring extra internal 
columns
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: **Yes** - Fixes the issue where 
Doris 2.0+ data cannot be read, making the connector compatible with Arrow data 
streams containing internal columns
   2. Has unit tests been added: **No Need** - This is a fix for existing 
logic; existing tests cover core functionality
   3. Has document been added or modified: **No Need** - This is an internal 
implementation fix that doesn't affect user-facing APIs
   4. Does it need to update dependencies: **No** - No dependency changes
   5. Are there any changes that cannot be rolled back: **No** - Can be safely 
rolled back
   
   ## Further comments
   
   ### Testing & Verification
   
   This fix has been verified in the following environment:
   - **Doris Version**: 2.0.x
   - **Spark Version**: 3.3
   - **Table Type**: Unique Key tables (containing `__DORIS_DELETE_SIGN__` 
internal column)
   
   ### Impact Scope
   
   - **Benefited scenarios**: All read operations using Doris 2.0+ with tables 
containing internal system columns
   - **Backward compatibility**: Fully compatible with older Doris versions, no 
impact on existing functionality
   - **Performance impact**: No performance impact, only adjusted field 
processing logic
   
   ### Related Issues
   
   This issue has been reported in the community:
   - Issue #349: Reports the same field count mismatch problem caused by 
`__DORIS_DELETE_SIGN__`
   - Error message: `org.apache.doris.spark.exception.DorisException: Load 
Doris data failed, schema size of fetch data is wrong`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to