[jira] [Updated] (TEZ-4595) Running into race condition when trying to submit a query to an Application having recovery file

Ratnesh Mishra (Jira) Fri, 20 Dec 2024 05:34:40 -0800


     [ 
https://issues.apache.org/jira/browse/TEZ-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ratnesh Mishra updated TEZ-4595:
--------------------------------
    Description: 
During query execution in Hive  we see a race condition when Hive tries to 
submit a query to a Application with recovery file  present.

*Steps to repro* 
1. Run a hive query and wait for it to finish
2. Restart(Kill)  the Tez AM (Saw same behaviour with *AM_REBOOT* event)
3. Fire another query (We need to do this fast enough for Hive to submit query 
to the same application Id )

*Observation* 
AM runs into race condition due to Dag Ids of incoming query and recovery Dag 
being same causing below error
{code:java}
Status: Failed

Invalid event DAG_VERTEX_COMPLETED on Dag dag_1734697167504_0005_1 at 
currentState=NEW

Invalid event DAG_RECOVER on Dag dag_1734697167504_0005_1 at currentState=ERROR

FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.tez.TezTask. Invalid event DAG_VERTEX_COMPLETED 
on Dag dag_1734697167504_0005_1 at currentState=NEWInvalid event DAG_RECOVER on 
Dag dag_1734697167504_0005_1 at currentState=ERROR
{code}
It seems the main issue here is the new DAG being accepted without recovery 
flow to have been completed.
[From code 
|https://github.com/apache/tez/blob/branch-0.10.2/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1295]
 we only check here if the status is _DAGAppMasterState.RUNNING_ and RPC server 
gets started in  *serviceInit()* after which AM can accept new DAG however 
recovery runs in *serviceStart()* which can cause recovery DAG to be set as 
currentDAG when the incoming new DAG is already executing which can lead to 
invalid state error or wrong query result in some cases(Old query being ran in 
place of new query). 
An another potential error is both of these DAGs getting same DAG id which in 
itself doesn't seems correct.

  was:
During query execution in Hive there's small window where we see a race 
condition when Hive tries to submit a query to a Application with recovery file 
 present.

*Steps to repro* 
1. Run a hive query and wait for it to finish
2. Restart(Kill)  the Tez AM (Saw same behaviour with *AM_REBOOT* event)
3. Fire another query (We need to do this fast enough for Hive to submit query 
to the same application Id )

*Observation* 
AM runs into race condition due to Dag Ids of incoming query and recovery Dag 
being same causing below error
{code:java}
Status: Failed

Invalid event DAG_VERTEX_COMPLETED on Dag dag_1734697167504_0005_1 at 
currentState=NEW

Invalid event DAG_RECOVER on Dag dag_1734697167504_0005_1 at currentState=ERROR

FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.tez.TezTask. Invalid event DAG_VERTEX_COMPLETED 
on Dag dag_1734697167504_0005_1 at currentState=NEWInvalid event DAG_RECOVER on 
Dag dag_1734697167504_0005_1 at currentState=ERROR
{code}
It seems the main issue here is the new DAG being accepted without recovery 
flow to have been completed.
[From code 
|https://github.com/apache/tez/blob/branch-0.10.2/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1295]
 we only check here if the status is _DAGAppMasterState.RUNNING_ and RPC server 
gets started in  *serviceInit()* after which AM can accept new DAG however 
recovery runs in *serviceStart()* which can cause recovery DAG to be set as 
currentDAG when the incoming new DAG is already executing which can lead to 
invalid state error or wrong query result in some cases(Old query being ran in 
place of new query). 
An another potential error is both of these DAGs getting same DAG id which in 
itself doesn't seems correct.


> Running into race condition when trying to submit a query to an Application  
> having recovery file
> -------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4595
>                 URL: https://issues.apache.org/jira/browse/TEZ-4595
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.10.2
>            Reporter: Ratnesh Mishra
>            Priority: Critical
>
> During query execution in Hive  we see a race condition when Hive tries to 
> submit a query to a Application with recovery file  present.
> *Steps to repro* 
> 1. Run a hive query and wait for it to finish
> 2. Restart(Kill)  the Tez AM (Saw same behaviour with *AM_REBOOT* event)
> 3. Fire another query (We need to do this fast enough for Hive to submit 
> query to the same application Id )
> *Observation* 
> AM runs into race condition due to Dag Ids of incoming query and recovery Dag 
> being same causing below error
> {code:java}
> Status: Failed
> Invalid event DAG_VERTEX_COMPLETED on Dag dag_1734697167504_0005_1 at 
> currentState=NEW
> Invalid event DAG_RECOVER on Dag dag_1734697167504_0005_1 at 
> currentState=ERROR
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask. Invalid event 
> DAG_VERTEX_COMPLETED on Dag dag_1734697167504_0005_1 at 
> currentState=NEWInvalid event DAG_RECOVER on Dag dag_1734697167504_0005_1 at 
> currentState=ERROR
> {code}
> It seems the main issue here is the new DAG being accepted without recovery 
> flow to have been completed.
> [From code 
> |https://github.com/apache/tez/blob/branch-0.10.2/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1295]
>  we only check here if the status is _DAGAppMasterState.RUNNING_ and RPC 
> server gets started in  *serviceInit()* after which AM can accept new DAG 
> however recovery runs in *serviceStart()* which can cause recovery DAG to be 
> set as currentDAG when the incoming new DAG is already executing which can 
> lead to invalid state error or wrong query result in some cases(Old query 
> being ran in place of new query). 
> An another potential error is both of these DAGs getting same DAG id which in 
> itself doesn't seems correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TEZ-4595) Running into race condition when trying to submit a query to an Application having recovery file

Reply via email to