[ https://issues.apache.org/jira/browse/TEZ-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144357#comment-14144357 ]
Bikas Saha commented on TEZ-992: -------------------------------- In this case, the flow looks like RUNNING -> WAIT_FOR_COMMIT_START_SAVED->DO_COMMIT->WAIT_FOR_VERTEX_FINISHED_SAVED->FINISH. So it does look like we need 3 states. However, there was an open item to do commit on a separate thread because commit itself can take a long time. Given that need for a separate thread. Maybe the easier thing to do would be to RUNNING->COMMITTING->FINISHED where the RUNNING->COMMITTING transition enqueues the commit operation on a thread or threadpool. After commit completes on the thread then it send an event to its vertex which moves COMMITTING->RUNNING. Given that the operation happens on a separate thread, this thread could do the following. If committer present then save commit_start (blocking), then commit. In both cases (commit present or not present), it will save finished (blocking) and then send an event to its vertex that will change vertex from committing to finished. Are there any other blocking operations? > Recovery data should not be written on AsyncDispatcher thread > ------------------------------------------------------------- > > Key: TEZ-992 > URL: https://issues.apache.org/jira/browse/TEZ-992 > Project: Apache Tez > Issue Type: Sub-task > Reporter: Bikas Saha > Assignee: Jeff Zhang > > This may block the DAG operations in case the recovery data needs to be > synchronously stored. The operations requiring this blocking operation should > change their state machines to wait for the store operation before moving > ahead. They will move ahead after they receive notification from the > RecoveryService that their operation has completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)