jihoonson commented on issue #8061: Native parallel batch indexing with shuffle URL: https://github.com/apache/incubator-druid/issues/8061#issuecomment-510699299 > are you planning to track access time in MM memory ? what happens if MM process gets restarted for some reason and then it will lose track of that. Yes, this is what I'm thinking. What I'm thinking is pretty similar to yours, but a bit additional stuffs. - MM will keep expiration time in memory. - This expiration time is initialized with `current time + configured timeout`. - MM periodically checks there are any new partitions created for new supervisorTasks and initializes the expiration time if it finds any. - When a subtask accesses a partition, the expiration time for the supervisorTask is initialized or updated if it's already there. - MM periodically checks those expiration times for supervisorTasks. If it finds any expired supervisorTask, then it will ask the overlord if the task is still running. If not, MM will remove all partitions for the supervisorTask. - The overlord will also send a cleanup request to MM when the supervisorTask is finished. This will clean up the expiration time. I think it's not very complex, but will reduce the number of calls to overlord API, so it would be good. > it would be extra nice to document the failure cases and expected behavior e.g. Ah, I didn't document them since it would be same with the existing parallel index task behavior. Here are descriptions on how it handles failures currently. > supervisor task process (or MM running it) crashed while phase1/2 tasks were running If the supervisorTask process is killed normally, `stopGracefully` method is called which kills all running subtasks. If it's killed abnormally, then parallel index task doesn't handle this case for now. > one or more of phase1/2 tasks crashed SupervisorTask monitors subtask statuses and counts how many subtasks have failed to process the same input. If it notices more failures than configured `maxRetry`, it regards that input can't be processed and exists with an error. Otherwise, it respawns a new task which processes the same input. I'll add these to the proposal. > are any/all of these tasks restorable i.e. return true for canRestore() ? Good point. Any task is not restorable now, but I think it might be useful to support rolling update in the future. Parallel index tasks are supposed to run for a long time and so it would be nice if it can be stopped/restored during rolling update.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
