morrifeldman opened a new issue #11719: URL: https://github.com/apache/druid/issues/11719
### Description On our production cluster, a native batch indexing job with 195 tasks experienced many task failures due to simultaneous loss of many indexing nodes. We run our indexing nodes on EC2 Spot instances and many instances were taken at the same time. The indexing job recovered and successfully reran all the failed sub tasks. Unfortunately when validating the data we found the metric counts in the database were too high. Dissecting the discrepancy we found a few days with completely duplicated segments -- same exact size and number of rows but different segment ids. For example segments `ltv_2021-07-26T00:00:00.000Z_2021-07-27T00:00:00.000Z_2021-09-14T12:24:13.709Z_31` and `ltv_2021-07-26T00:00:00.000Z_2021-07-27T00:00:00.000Z_2021-09-14T12:24:13.709Z_33` have exactly the same size and number of rows. Diving into the logs suggests the following sequence of events. * `single_phase_sub_task_ltv_bpkhjbad_2021-09-15T05:17:43.593Z` pushed segment `ltv_2021-07-26T00:00:00.000Z_2021-07-27T00:00:00.000Z_2021-09-14T12:24:13.709Z_31` -- [see `bpkhjbad` log](https://github.com/apache/druid/files/7180752/bpkhjbad-noGC.log) * `bpkhjbad` finished succesfully but the master thought it failed -- ` task[single_phase_sub_task_ltv_bpkhjbad_2021-09-15T05:17:43.593Z] in state[RUNNING] suddenly disappeared on worker[druidindex-29118-030-prod.eu1.appsflyer.com:8091]. failing it.` * Then `single_phase_sub_task_ltv_adopcgba_2021-09-15T05:31:16.923Z` was launched as a rerun -- [see the supervisor log](https://github.com/apache/druid/files/7180755/index_parallel_ltv_fiiamkgi-noGC.log) * `abopcgba` pushed `ltv_2021-07-26T00:00:00.000Z_2021-07-27T00:00:00.000Z_2021-09-14T12:24:13.709Z_33` and finished successfully -- [see `abopcgba` log](https://github.com/apache/druid/files/7180754/abopcgba-noGC.log) * Unfortunately, at the end of the job both segments were published as seen in the master log resulting in the duplicate segments: ``` Published segment [ltv_2021-07-26T00:00:00.000Z_2021-07-27T00:00:00.000Z_2021-09-14T12:24:13.709Z_31] to DB with used flag [true], json[{"dataSource":"ltv","interval":"2021-07-26T00:00:00.000Z/2021-07-27T00:00:00.000Z","version":"2021-09-14T12:24:13.709Z",... ``` ``` Published segment [ltv_2021-07-26T00:00:00.000Z_2021-07-27T00:00:00.000Z_2021-09-14T12:24:13.709Z_33] to DB with used flag [true], json[{"dataSource":"ltv","interval":"2021-07-26T00:00:00.000Z/2021-07-27T00:00:00.000Z","version":"2021-09-14T12:24:13.709Z",... ``` The same sequence of events happened in another pair of subtasks resulting overall in 5 pairs of duplicated segments. Attached are the logs for the [`bpkhjbad`](https://github.com/apache/druid/files/7180752/bpkhjbad-noGC.log) and [`adopcgba`](https://github.com/apache/druid/files/7180754/abopcgba-noGC.log) subtasks as well as the [supervisor index_parallel log](https://github.com/apache/druid/files/7180755/index_parallel_ltv_fiiamkgi-noGC.log). It seems that after the `bpkhjbad` subtask failed the segments that it pushed should have been cleaned up or at least not published at the end of the job. ### Version 0.19.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
