xy720 opened a new issue, #33919: URL: https://github.com/apache/doris/issues/33919
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version Only in branch-2.0 ### What's Wrong? Data loss and incorrect may occur when the table is random distribution ### What You Expected? well ### How to Reproduce? **There are two problems in random distribution table:** ### 1. data incorrect: 1.1. prepare a random distribution table ``` CREATE TABLE `test1` ( `ddate` DATE NULL COMMENT '日期字段', `dhour` VARCHAR(*) NULL COMMENT '小时字段', `server_time` DATETIME NULL COMMENT '日志落地服务器的时间', `log_type` VARCHAR(*) NULL COMMENT '日志类型', `source_flag` VARCHAR(*) NULL COMMENT '来源标识', `a` VARCHAR(*) NULL COMMENT '数组', `ext_json` VARCHAR(*) NULL COMMENT '扩展json' ) ENGINE=OLAP DUPLICATE KEY(`ddate`) COMMENT 'OLAP' PARTITION BY RANGE(`ddate`) (PARTITION p202403 VALUES [('2024-03-01'), ('2024-04-01')), PARTITION p202404 VALUES [('2024-04-01'), ('2024-05-01'))) DISTRIBUTED BY RANDOM BUCKETS 16 PROPERTIES ( "replication_allocation" = "tag.location.default: 1", "min_load_replica_num" = "-1", "is_being_synced" = "false", "inverted_index_storage_format" = "V1", "light_schema_change" = "true", "disable_auto_compaction" = "false" ); ``` 1.2. load some data 1.3 load results: ``` MySQL [test]> select ddate,count(1) from test1 group by ddate order by ddate; +------------+----------+ | ddate | count(1) | +------------+----------+ | 2024-03-27 | 357 | | 2024-03-28 | 3185 | | 2024-03-29 | 2266 | | 2024-04-01 | 20612 | | 2024-04-02 | 16488 | | 2024-04-03 | 12204 | | 2024-04-04 | 38795 | | 2024-04-05 | 43487 | | 2024-04-06 | 52783 | | 2024-04-07 | 31185 | | 2024-04-08 | 22138 | +------------+----------+ 11 rows in set (0.04 sec) ``` 1.4 drop the partition p202403 ``` MySQL [test]> alter table test1 drop partition p202404; Query OK, 0 rows affected (0.01 sec) ``` 1.5 Now you can find the data is incorrect. (The data for April still exists). ``` MySQL [test]> select ddate,count(1) from test1 group by ddate order by ddate; +------------+----------+ | ddate | count(1) | +------------+----------+ | 2024-03-28 | 208 | | 2024-03-29 | 631 | | 2024-04-01 | 1891 | | 2024-04-02 | 720 | | 2024-04-03 | 881 | | 2024-04-04 | 2191 | | 2024-04-05 | 3075 | | 2024-04-06 | 698 | | 2024-04-07 | 980 | | 2024-04-08 | 917 | +------------+----------+ 10 rows in set (0.02 sec) ``` ### 2. data loss: 2.1. prepare a random distribution table ``` CREATE TABLE `test1` ( `ddate` DATE NULL COMMENT '日期字段', `dhour` VARCHAR(*) NULL COMMENT '小时字段', `server_time` DATETIME NULL COMMENT '日志落地服务器的时间', `log_type` VARCHAR(*) NULL COMMENT '日志类型', `source_flag` VARCHAR(*) NULL COMMENT '来源标识', `a` VARCHAR(*) NULL COMMENT '数组', `ext_json` VARCHAR(*) NULL COMMENT '扩展json' ) ENGINE=OLAP DUPLICATE KEY(`ddate`) COMMENT 'OLAP' PARTITION BY RANGE(`ddate`) (PARTITION p202403 VALUES [('2024-03-01'), ('2024-04-01')), PARTITION p202404 VALUES [('2024-04-01'), ('2024-05-01'))) DISTRIBUTED BY RANDOM BUCKETS 16 PROPERTIES ( "replication_allocation" = "tag.location.default: 1", "min_load_replica_num" = "-1", "is_being_synced" = "false", "inverted_index_storage_format" = "V1", "light_schema_change" = "true", "disable_auto_compaction" = "false" ); ``` 2.2. load some data 2.3 load results: ``` MySQL [test]> select ddate,count(1) from test1 group by ddate order by ddate; +------------+----------+ | ddate | count(1) | +------------+----------+ | 2024-03-27 | 357 | | 2024-03-28 | 3185 | | 2024-03-29 | 2266 | | 2024-04-01 | 20612 | | 2024-04-02 | 16488 | | 2024-04-03 | 12204 | | 2024-04-04 | 38795 | | 2024-04-05 | 43487 | | 2024-04-06 | 52783 | | 2024-04-07 | 31185 | | 2024-04-08 | 22138 | +------------+----------+ 11 rows in set (0.04 sec) MySQL [test]> select count(*) from test1; +----------+ | count(*) | +----------+ | 243500 | +----------+ 1 row in set (0.02 sec) ``` 2.4 create a same struture table ``` MySQL [test]> create table test2 like test1; Query OK, 0 rows affected (0.01 sec) ``` 2.5 insert data of test1 to test2 ``` MySQL [test]> insert into test2 select * from test1; Query OK, 243500 rows affected (2.02 sec) {'label':'label_f7cf2cad0b164cfb_ad6b846c80e5ac5a', 'status':'VISIBLE', 'txnId':'148'} ``` 2.6 Now you can find some data is loss. ``` MySQL [test]> select ddate,count(1) from test2 group by ddate order by ddate; +------------+----------+ | ddate | count(1) | +------------+----------+ | 2024-04-01 | 5172 | | 2024-04-02 | 8049 | | 2024-04-03 | 8373 | | 2024-04-04 | 35737 | | 2024-04-05 | 43487 | | 2024-04-06 | 52783 | | 2024-04-07 | 31185 | | 2024-04-08 | 22138 | +------------+----------+ 8 rows in set (0.06 sec) MySQL [test]> select count(*) from test2; +----------+ | count(*) | +----------+ | 206924 | +----------+ 1 row in set (0.05 sec) ``` ### Anything Else? _No response_ ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
