zclllyybb commented on issue #64806:
URL: https://github.com/apache/doris/issues/64806#issuecomment-4795269121

   Breakwater-GitHub-Analysis-Slot: slot_bec3a9c3488a
   This content is generated by AI for reference only.
   
   Initial triage: this looks like a real Doris bug in the multi-table CDC 
write path, not a MySQL table charset problem and probably not an auto-created 
Doris DDL problem.
   
   I checked the 4.1.1 code path against the local `4.1.1-rc01` tag. The 
auto-create path builds target columns from JDBC metadata in 
`StreamingJobUtils.generateCreateTableCmds()`, and `getColumns()` already 
expands MySQL `varchar`/`char` length by 3 for UTF-8 byte width. That matches 
the report that `SHOW CREATE TABLE` does not show a meaningful DDL difference.
   
   The important difference is after deserialization:
   
   - TVF mode: `PipelineCoordinator.buildStreamRecords()` writes each JSON 
record with `record.getBytes(StandardCharsets.UTF_8)`.
   - Multi-table CDC mode: `PipelineCoordinator.writeRecords()` sends records 
to `DorisBatchStreamLoad` with `record.getBytes()`.
   
   `String.getBytes()` without an explicit charset uses the CDC client JVM 
default charset. If that JVM default is not UTF-8, Chinese characters can be 
replaced by literal `?` bytes before Stream Load receives the payload. This 
exactly matches the symptom where Chinese text becomes `???`, and also explains 
why adding `useUnicode=true&characterEncoding=utf-8` to the MySQL `jdbc_url` 
does not help: the loss happens later, while converting the deserialized JSON 
`String` to bytes for the multi-table Stream Load path. The Stream Load content 
type is already `application/json;charset=UTF-8`, but by that point the bytes 
may already contain `?`.
   
   The same default-charset call still exists on the local `upstream/master` 
ref, so this does not look limited to only the 4.1.1 release branch.
   
   Suggested fix:
   
   1. Change the multi-table CDC write path to encode records explicitly with 
UTF-8, for example:
   
   ```java
   batchStreamLoad.writeRecord(targetDb, dorisTable, 
record.getBytes(StandardCharsets.UTF_8));
   ```
   
   2. Add a focused cdc-client test that writes a JSON record containing 
Chinese text through the multi-table `writeRecords`/`DorisBatchStreamLoad` 
path, preferably with a non-UTF-8 JVM default charset or by directly asserting 
the produced bytes.
   
   Useful confirmation data from the reporter, if maintainers want to verify 
the environment:
   
   - BE/cdc-client logs for the affected streaming job and task.
   - The exact `CREATE STREAMING JOB` statement with credentials removed.
   - MySQL `SHOW VARIABLES LIKE 'character_set%';` and `SHOW VARIABLES LIKE 
'collation%';`.
   - A source-side sample such as `SELECT col, HEX(col) ...` and the 
corresponding Doris `SELECT col, HEX(col) ...`.
   - The CDC client JVM/default locale information if available, especially 
`file.encoding`, `LANG`, and `LC_ALL`.
   
   Temporary mitigation, if a patch cannot be applied immediately: ensure the 
cdc-client JVM runs with UTF-8 as its default charset, but the code should 
still be fixed to avoid depending on process locale.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to