zclllyybb commented on issue #64334: URL: https://github.com/apache/doris/issues/64334#issuecomment-4661384771
Breakwater-GitHub-Analysis-Slot: slot_ee38da6ce046 Initial code-path analysis: This looks like a real BE-side bug in the non-strict CHAR/VARCHAR truncation path, not a Kafka or JSON parsing issue. The reported build hash maps to `4.1.0-rc03`. In routine load, `strict_mode=false` is propagated into `enable_insert_strict=false` in `be/src/load/routine_load/routine_load_task_executor.cpp`, so the non-strict truncation branch in `be/src/exec/sink/vtablet_block_convertor.cpp` is expected to run. The problematic path is: 1. `OlapTableBlockConvertor::_internal_validate_column()` checks the byte length of string columns against the schema length. 2. In non-strict mode, it tries to truncate over-length string values by executing `substring(value, 1, schema_len)`. 3. The same function then validates the result again by byte length. The mismatch is that the schema length is treated as bytes, while `substring` counts UTF-8 characters on the non-ASCII path. This is already hinted by the comment in `vtablet_block_convertor.cpp`: schema length is byte-based, while `substring` works by character units. That explains both examples: - `中123456789012345678901234567890` is 31 characters but 33 UTF-8 bytes. `substring(..., 32)` keeps the whole value, so the later byte-length check still sees 33 bytes for `VARCHAR(32)` and rejects it. - The `U+0131` example similarly has a 32-character prefix that is 33 UTF-8 bytes, so it fails after the attempted truncation. So the current behavior is internally consistent with the code, but it is not the intended non-strict load behavior. `strict_mode=false` cannot work around it because the row is still rechecked after the character-based truncation. Suggested next steps: 1. Fix the non-strict load truncation in `OlapTableBlockConvertor` to trim to a valid UTF-8 prefix whose byte length is no larger than the target CHAR/VARCHAR byte length, instead of using character-count `substring`. 2. Cover both `TYPE_VARCHAR` and `TYPE_CHAR`, since they share this validation branch. 3. Add a regression case with a non-strict load into `VARCHAR(32)` using both a 2-byte character such as `U+0131` and a 3-byte Chinese character. A focused BE test around `OlapTableBlockConvertor` would catch the exact logic; an end-to-end stream/routine load regression would also be useful. Temporary workaround before a fix: ensure the producer sends values already truncated to a UTF-8-safe 32-byte prefix, or increase the target `VARCHAR` length. Raising `max_filter_ratio` or keeping `strict_mode=false` will not make this specific row load successfully because the post-truncation byte-length validation still rejects it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
