baishaoisde opened a new issue, #92: URL: https://github.com/apache/doris-spark-connector/issues/92
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/incubator-doris/issues?q=is%3Aissue) and found no similar issues. ### Description If the amount of data in a partition is greater than INSERT_BARCH_SIZE, each task commits multiple StreamLoad tasks. If the task fails to retry, all data in the partition is recommitted to the StreamLoad task, as well as the data that was previously successfully written. Data duplication occurs. 当一个分区中的数据量大于参数INSERT_BARCH_SIZE时,每个task便会提交多个StreamLoad任务,如果任务发生失败重试,那么该分区的所有数据便会重新提交StreamLoad任务,对于之前成功写入的数据也会重新提交,造成数据重复。 我的建议是增加一个参数,如果开启则强制每个分区只提交一个StreamLoad,保证数据不会被重复提交。 ### Solution My suggestion is to add a parameter that, if enabled, forces only one StreamLoad per partition to ensure that data is not repeatedly committed. 我的建议是增加一个参数,如果开启则强制每个分区只提交一个StreamLoad,保证数据不会被重复提交。 ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
