jihoonson commented on issue #8061: Native parallel batch indexing with shuffle
URL: 
https://github.com/apache/incubator-druid/issues/8061#issuecomment-510699299
 
 
   > are you planning to track access time in MM memory ? what happens if MM 
process gets restarted for some reason and then it will lose track of that.
   
   Yes, this is what I'm thinking. What I'm thinking is pretty similar to 
yours, but a bit additional stuffs. 
   
   - MM will keep expiration time in memory. 
   - This expiration time is initialized with `current time + configured 
timeout`. 
   - MM periodically checks there are any new partitions created for new 
supervisorTasks and initializes the expiration time if it finds any.
   - When a subtask accesses a partition, the expiration time for the 
supervisorTask is initialized or updated if it's already there.
   - MM periodically checks those expiration times for supervisorTasks. If it 
finds any expired supervisorTask, then it will ask the overlord if the task is 
still running. If not, MM will remove all partitions for the supervisorTask.
   - The overlord will also send a cleanup request to MM when the 
supervisorTask is finished. This will clean up the expiration time.
   
   I think it's not very complex, but will reduce the number of calls to 
overlord API, so it would be good.
   
   > it would be extra nice to document the failure cases and expected behavior 
e.g.
   
   Ah, I didn't document them since it would be same with the existing parallel 
index task behavior.
   Here are descriptions on how it handles failures currently.
   
   > supervisor task process (or MM running it) crashed while phase1/2 tasks 
were running
   
   If the supervisorTask process is killed normally, `stopGracefully` method is 
called which kills all running subtasks. If it's killed abnormally, then 
parallel index task doesn't handle this case for now.
   
   > one or more of phase1/2 tasks crashed
   
   SupervisorTask monitors subtask statuses and counts how many subtasks have 
failed to process the same input. If it notices more failures than configured 
`maxRetry`, it regards that input can't be processed and exists with an error. 
Otherwise, it respawns a new task which processes the same input.
   
   I'll add these to the proposal.
   
   > are any/all of these tasks restorable i.e. return true for canRestore() ?
   
   Good point. Any task is not restorable now, but I think it might be useful 
to support rolling update in the future. Parallel index tasks are supposed to 
run for a long time and so it would be nice if it can be stopped/restored 
during rolling update.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to