morrifeldman opened a new issue #11719:
URL: https://github.com/apache/druid/issues/11719


   ### Description
   
   On our production cluster, a native batch indexing job with 195 tasks 
experienced many task failures due to simultaneous loss of many indexing nodes. 
 We run our indexing nodes on EC2 Spot instances and many instances were taken 
at the same time.  The indexing job recovered and successfully reran all the 
failed sub tasks.  Unfortunately when validating the data we found the metric 
counts in the database were too high.  Dissecting the discrepancy we found a 
few days with completely duplicated segments -- same exact size and number of 
rows but different segment ids.
   
   For example segments 
`ltv_2021-07-26T00:00:00.000Z_2021-07-27T00:00:00.000Z_2021-09-14T12:24:13.709Z_31`
 and 
`ltv_2021-07-26T00:00:00.000Z_2021-07-27T00:00:00.000Z_2021-09-14T12:24:13.709Z_33`
 have exactly the same size and number of rows.  Diving into the logs suggests 
the following sequence of events.
   
   * `single_phase_sub_task_ltv_bpkhjbad_2021-09-15T05:17:43.593Z` pushed 
segment 
`ltv_2021-07-26T00:00:00.000Z_2021-07-27T00:00:00.000Z_2021-09-14T12:24:13.709Z_31`
 -- [see `bpkhjbad` 
log](https://github.com/apache/druid/files/7180752/bpkhjbad-noGC.log)
   * `bpkhjbad` finished succesfully but the master thought it failed -- `
   task[single_phase_sub_task_ltv_bpkhjbad_2021-09-15T05:17:43.593Z] in 
state[RUNNING] suddenly disappeared on 
worker[druidindex-29118-030-prod.eu1.appsflyer.com:8091]. failing it.` 
   * Then `single_phase_sub_task_ltv_adopcgba_2021-09-15T05:31:16.923Z` was 
launched as a rerun -- [see the supervisor 
log](https://github.com/apache/druid/files/7180755/index_parallel_ltv_fiiamkgi-noGC.log)
   * `abopcgba` pushed 
`ltv_2021-07-26T00:00:00.000Z_2021-07-27T00:00:00.000Z_2021-09-14T12:24:13.709Z_33`
 and finished successfully -- [see `abopcgba` 
log](https://github.com/apache/druid/files/7180754/abopcgba-noGC.log)
   * Unfortunately, at the end of the job both segments were published as seen 
in the master log resulting in the duplicate segments:
   
   ```
   Published segment 
[ltv_2021-07-26T00:00:00.000Z_2021-07-27T00:00:00.000Z_2021-09-14T12:24:13.709Z_31]
 to DB with used flag [true], 
json[{"dataSource":"ltv","interval":"2021-07-26T00:00:00.000Z/2021-07-27T00:00:00.000Z","version":"2021-09-14T12:24:13.709Z",...
   ```
   
   ```
   Published segment 
[ltv_2021-07-26T00:00:00.000Z_2021-07-27T00:00:00.000Z_2021-09-14T12:24:13.709Z_33]
 to DB with used flag [true], 
json[{"dataSource":"ltv","interval":"2021-07-26T00:00:00.000Z/2021-07-27T00:00:00.000Z","version":"2021-09-14T12:24:13.709Z",...
   ```
   
   The same sequence of events happened in another pair of subtasks resulting 
overall in 5 pairs of duplicated segments.
   
   Attached are the logs for the 
[`bpkhjbad`](https://github.com/apache/druid/files/7180752/bpkhjbad-noGC.log) 
and 
[`adopcgba`](https://github.com/apache/druid/files/7180754/abopcgba-noGC.log) 
subtasks as well as the [supervisor index_parallel 
log](https://github.com/apache/druid/files/7180755/index_parallel_ltv_fiiamkgi-noGC.log).
   
   It seems that after the `bpkhjbad` subtask failed the segments that it 
pushed should have been cleaned up or at least not published at the end of the 
job.
   
   ### Version 0.19.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to