Sahil Takiar created IMPALA-9295:
------------------------------------

             Summary: RPC failures don't always trigger a blacklist
                 Key: IMPALA-9295
                 URL: https://issues.apache.org/jira/browse/IMPALA-9295
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
            Reporter: Sahil Takiar
            Assignee: Sahil Takiar


There is a race condition in IMPALA-9137. It is possible for the aux_error_info 
and the failure status to arrive in separate exec status reports.

IMPALA-9137 added AuxErrorInfoPB to FragmentInstanceExecStatusPB (contains 
per-fragment info for a ReportExecStatusRequestPB). The idea is that if a query 
fails, the Coordinator would use the AuxErrorInfoPB to potentially blacklist 
any nodes that caused the failure. The Coordinator only looks for 
AuxErrorInfoPB if the query has failed (e.g. if 
ReportExecStatusRequestPB::overall_status is set to an error).

The issue is that is is possible that the AuxErrorInfoPB is set even though 
overall_status == OK. There is a race condition on the executor side where the 
setting of the aux_error_info and and overall_status is not synchronized. So if 
a fragment fails due to an RPC error, it is possible for report number "x" to 
include the aux_error_info with overall_status == OK, and report number "x + 1" 
to include no aux_error_info with overall_status == [some-RPC-failure-message].

Report num "x +1" won't include the aux_error_info since the fragment has 
finished and its last FragmentInstanceExecStatusPB was in report num "x".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to