zclllyybb commented on issue #64806: URL: https://github.com/apache/doris/issues/64806#issuecomment-4795269121
Breakwater-GitHub-Analysis-Slot: slot_bec3a9c3488a This content is generated by AI for reference only. Initial triage: this looks like a real Doris bug in the multi-table CDC write path, not a MySQL table charset problem and probably not an auto-created Doris DDL problem. I checked the 4.1.1 code path against the local `4.1.1-rc01` tag. The auto-create path builds target columns from JDBC metadata in `StreamingJobUtils.generateCreateTableCmds()`, and `getColumns()` already expands MySQL `varchar`/`char` length by 3 for UTF-8 byte width. That matches the report that `SHOW CREATE TABLE` does not show a meaningful DDL difference. The important difference is after deserialization: - TVF mode: `PipelineCoordinator.buildStreamRecords()` writes each JSON record with `record.getBytes(StandardCharsets.UTF_8)`. - Multi-table CDC mode: `PipelineCoordinator.writeRecords()` sends records to `DorisBatchStreamLoad` with `record.getBytes()`. `String.getBytes()` without an explicit charset uses the CDC client JVM default charset. If that JVM default is not UTF-8, Chinese characters can be replaced by literal `?` bytes before Stream Load receives the payload. This exactly matches the symptom where Chinese text becomes `???`, and also explains why adding `useUnicode=true&characterEncoding=utf-8` to the MySQL `jdbc_url` does not help: the loss happens later, while converting the deserialized JSON `String` to bytes for the multi-table Stream Load path. The Stream Load content type is already `application/json;charset=UTF-8`, but by that point the bytes may already contain `?`. The same default-charset call still exists on the local `upstream/master` ref, so this does not look limited to only the 4.1.1 release branch. Suggested fix: 1. Change the multi-table CDC write path to encode records explicitly with UTF-8, for example: ```java batchStreamLoad.writeRecord(targetDb, dorisTable, record.getBytes(StandardCharsets.UTF_8)); ``` 2. Add a focused cdc-client test that writes a JSON record containing Chinese text through the multi-table `writeRecords`/`DorisBatchStreamLoad` path, preferably with a non-UTF-8 JVM default charset or by directly asserting the produced bytes. Useful confirmation data from the reporter, if maintainers want to verify the environment: - BE/cdc-client logs for the affected streaming job and task. - The exact `CREATE STREAMING JOB` statement with credentials removed. - MySQL `SHOW VARIABLES LIKE 'character_set%';` and `SHOW VARIABLES LIKE 'collation%';`. - A source-side sample such as `SELECT col, HEX(col) ...` and the corresponding Doris `SELECT col, HEX(col) ...`. - The CDC client JVM/default locale information if available, especially `file.encoding`, `LANG`, and `LC_ALL`. Temporary mitigation, if a patch cannot be applied immediately: ensure the cdc-client JVM runs with UTF-8 as its default charset, but the code should still be fixed to avoid depending on process locale. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
