[ 
https://issues.apache.org/jira/browse/TEZ-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ratnesh Mishra updated TEZ-4595:
--------------------------------
    Description: 
Running into race condition when Hive tries to submit a query to an Application 
with recovery file  present.
When trying to submit query to an application which already has previous DAG 
present via recovery file leads to race condition in AM 

*Steps to repro* 
1. Run a hive query and wait for it finish
2. Restart the Tez AM (Saw same same behaviour with AM_REBOOT event)
3. Fire another query (We need to ensure that the second query runs in the same 
Application container as before)

*Observation* 
AM runs into race condition due to Dag Ids of incoming query as well as 
recovery dag being same causing below error
{code:java}
Status: Failed

Invalid event DAG_VERTEX_COMPLETED on Dag dag_1734697167504_0005_1 at 
currentState=NEW

Invalid event DAG_RECOVER on Dag dag_1734697167504_0005_1 at currentState=ERROR

FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.tez.TezTask. Invalid event DAG_VERTEX_COMPLETED 
on Dag dag_1734697167504_0005_1 at currentState=NEWInvalid event DAG_RECOVER on 
Dag dag_1734697167504_0005_1 at currentState=ERROR
{code}
It seems the main issue here is the new DAG being accepted without recovery 
flow to have completed.
[From code 
|https://github.com/apache/tez/blob/branch-0.10.2/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1295]
 we only check here if the status is _DAGAppMasterState.RUNNING_ and this flow 
can run once *serviceInit()* runs however recovery runs in *serviceStart()* 
which can cause recovery DAG to be set when the incoming DAG is already 
executing which can lead to invalid state error or wrong query result in some 
cases(Old query being ran in place of new query). An another potential error is 
both of these DAGs getting same DAG id which in itself doesn't seems correct.

  was:Running into race condition when Hive tries to submit a query to an 
Application  with recovery file  present


> Running into race condition when trying to submit a query to an Application  
> having recovery file
> -------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4595
>                 URL: https://issues.apache.org/jira/browse/TEZ-4595
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.10.2
>            Reporter: Ratnesh Mishra
>            Priority: Critical
>
> Running into race condition when Hive tries to submit a query to an 
> Application with recovery file  present.
> When trying to submit query to an application which already has previous DAG 
> present via recovery file leads to race condition in AM 
> *Steps to repro* 
> 1. Run a hive query and wait for it finish
> 2. Restart the Tez AM (Saw same same behaviour with AM_REBOOT event)
> 3. Fire another query (We need to ensure that the second query runs in the 
> same Application container as before)
> *Observation* 
> AM runs into race condition due to Dag Ids of incoming query as well as 
> recovery dag being same causing below error
> {code:java}
> Status: Failed
> Invalid event DAG_VERTEX_COMPLETED on Dag dag_1734697167504_0005_1 at 
> currentState=NEW
> Invalid event DAG_RECOVER on Dag dag_1734697167504_0005_1 at 
> currentState=ERROR
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask. Invalid event 
> DAG_VERTEX_COMPLETED on Dag dag_1734697167504_0005_1 at 
> currentState=NEWInvalid event DAG_RECOVER on Dag dag_1734697167504_0005_1 at 
> currentState=ERROR
> {code}
> It seems the main issue here is the new DAG being accepted without recovery 
> flow to have completed.
> [From code 
> |https://github.com/apache/tez/blob/branch-0.10.2/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1295]
>  we only check here if the status is _DAGAppMasterState.RUNNING_ and this 
> flow can run once *serviceInit()* runs however recovery runs in 
> *serviceStart()* which can cause recovery DAG to be set when the incoming DAG 
> is already executing which can lead to invalid state error or wrong query 
> result in some cases(Old query being ran in place of new query). An another 
> potential error is both of these DAGs getting same DAG id which in itself 
> doesn't seems correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to