zclllyybb commented on issue #64334:
URL: https://github.com/apache/doris/issues/64334#issuecomment-4661384771

   Breakwater-GitHub-Analysis-Slot: slot_ee38da6ce046
   
   Initial code-path analysis:
   
   This looks like a real BE-side bug in the non-strict CHAR/VARCHAR truncation 
path, not a Kafka or JSON parsing issue.
   
   The reported build hash maps to `4.1.0-rc03`. In routine load, 
`strict_mode=false` is propagated into `enable_insert_strict=false` in 
`be/src/load/routine_load/routine_load_task_executor.cpp`, so the non-strict 
truncation branch in `be/src/exec/sink/vtablet_block_convertor.cpp` is expected 
to run.
   
   The problematic path is:
   
   1. `OlapTableBlockConvertor::_internal_validate_column()` checks the byte 
length of string columns against the schema length.
   2. In non-strict mode, it tries to truncate over-length string values by 
executing `substring(value, 1, schema_len)`.
   3. The same function then validates the result again by byte length.
   
   The mismatch is that the schema length is treated as bytes, while 
`substring` counts UTF-8 characters on the non-ASCII path. This is already 
hinted by the comment in `vtablet_block_convertor.cpp`: schema length is 
byte-based, while `substring` works by character units.
   
   That explains both examples:
   
   - `中123456789012345678901234567890` is 31 characters but 33 UTF-8 bytes. 
`substring(..., 32)` keeps the whole value, so the later byte-length check 
still sees 33 bytes for `VARCHAR(32)` and rejects it.
   - The `U+0131` example similarly has a 32-character prefix that is 33 UTF-8 
bytes, so it fails after the attempted truncation.
   
   So the current behavior is internally consistent with the code, but it is 
not the intended non-strict load behavior. `strict_mode=false` cannot work 
around it because the row is still rechecked after the character-based 
truncation.
   
   Suggested next steps:
   
   1. Fix the non-strict load truncation in `OlapTableBlockConvertor` to trim 
to a valid UTF-8 prefix whose byte length is no larger than the target 
CHAR/VARCHAR byte length, instead of using character-count `substring`.
   2. Cover both `TYPE_VARCHAR` and `TYPE_CHAR`, since they share this 
validation branch.
   3. Add a regression case with a non-strict load into `VARCHAR(32)` using 
both a 2-byte character such as `U+0131` and a 3-byte Chinese character. A 
focused BE test around `OlapTableBlockConvertor` would catch the exact logic; 
an end-to-end stream/routine load regression would also be useful.
   
   Temporary workaround before a fix: ensure the producer sends values already 
truncated to a UTF-8-safe 32-byte prefix, or increase the target `VARCHAR` 
length. Raising `max_filter_ratio` or keeping `strict_mode=false` will not make 
this specific row load successfully because the post-truncation byte-length 
validation still rejects it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to