[ 
https://issues.apache.org/jira/browse/IMPALA-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246881#comment-17246881
 ] 

Wenzhe Zhou commented on IMPALA-10336:
--------------------------------------

This issue is related to my code change for 
[IMPALA-6788|https://gerrit.cloudera.org/#/c/16192/], which made the 
coordinator to cancel RPC in the flight, and happened in rare race conditions.
Tried to find the root cause from log messages and here are what are found:
 
1) Coordinator started execution of query fragment instances on three 
executors.  
2) The first executor started to execute instances, got "Type mismatch" error 
as expected, and sent a status report to coordinator with the error.
3) Coordinator called Coordinator::UpdateBackendExecStatus() to process the 
backend status report from the first executor. Since the report was received 
before all exec RPCs were complete, it sent a notification to 
exec_rpcs_status_barrier_, and waited until all exec RPCs were complete.
4) Coordinator called CancelBackends() in Coordinator::StartBackendExec() to 
cancel all the backends, including the third RPC in flight. 
5) The RPC call was received by 3-rd executor, but somehow the RPC response was 
not received by the coordinator so the RPC callback function was not called.
6)  The third executor started to execute query instances and sent a status 
report.
7) RPC callback function was called for the third RPC due to cancellation, the 
callback function set the status_ of the third backend as "RPC aborted due to 
cancellation".
8) Coordinator processed the status report from the third executor. Since the 
executer's status_ was already set as error in the previous step, it set 
overall execution status as "RPC aborted due to cancellation". 
 
 
When we call Coordinator::UpdateExecState() in 
Coordinator::UpdateBackendExecStatus(),  we should use the backend's status 
reported from backend, and avoid using the backend's status set by coordinator. 
"RPC aborted due to cancellation" error was set for the backend by coordinator 
in callback function BackendState::ExecCompleteCb(). 
In Coordinator::UpdateExecState(), we call backend_state->GetStatus() after 
calling BackendState::ApplyExecStatusReport(). The coordinator could update 
backend's status in between.
So in the current code, it's possible to use the backend's status set by 
coordinator to call Coordinator::UpdateExecState(). 
 
To avoid this race condition, we should add more parameters in 
BackendState::ApplyExecStatusReport() to return backend status, instead of 
separate function call.
 

> test_type_conversions_hive3 fails because incorrect error is returned to 
> client
> -------------------------------------------------------------------------------
>
>                 Key: IMPALA-10336
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10336
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>            Reporter: Tim Armstrong
>            Assignee: Wenzhe Zhou
>            Priority: Blocker
>              Labels: broken-build
>         Attachments: coordinator_logs.txt, 
> impalad.impala-ec2-centos74-m5-4xlarge-ondemand-1742.vpc.cloudera.com.jenkins.log.INFO.20201113-170705.10502.gz,
>  
> impalad.impala-ec2-centos74-m5-4xlarge-ondemand-1742.vpc.cloudera.com.jenkins.log.INFO.20201113-170705.10506.gz,
>  
> impalad.impala-ec2-centos74-m5-4xlarge-ondemand-1742.vpc.cloudera.com.jenkins.log.INFO.20201113-170705.10512.gz
>
>
> {noformat}
> Regression
> query_test.test_scanners.TestOrc.test_type_conversions_hive3[protocol: 
> beeswax | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': True, 
> 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | table_format: 
> orc/def/block] (from pytest)
> Failing for the past 1 build (Since Failed#102 )
> Took 1 min 43 sec.
> add description
> Error Message
> query_test/test_scanners.py:1538: in test_type_conversions_hive3     
> self.run_test_case('DataErrorsTest/orc-type-checks', vector, unique_database) 
> common/impala_test_suite.py:668: in run_test_case     
> self.__verify_exceptions(test_section['CATCH'], str(e), use_db) 
> common/impala_test_suite.py:485: in __verify_exceptions     (expected_str, 
> actual_str) E   AssertionError: Unexpected exception string. Expected: Type 
> mismatch: table column TINYINT is map to column smallint in ORC file E   Not 
> found in actual: ImpalaBeeswaxException: Query aborted:ExecQueryFInstances 
> rpc query_id=8f461cf08845e57c:32ec8ff300000000 failed: Exec() rpc failed: 
> Aborted: ExecQueryFInstances RPC to 127.0.0.1:27002 is cancelled in state SENT
> Stacktrace
> query_test/test_scanners.py:1538: in test_type_conversions_hive3
>     self.run_test_case('DataErrorsTest/orc-type-checks', vector, 
> unique_database)
> common/impala_test_suite.py:668: in run_test_case
>     self.__verify_exceptions(test_section['CATCH'], str(e), use_db)
> common/impala_test_suite.py:485: in __verify_exceptions
>     (expected_str, actual_str)
> E   AssertionError: Unexpected exception string. Expected: Type mismatch: 
> table column TINYINT is map to column smallint in ORC file
> E   Not found in actual: ImpalaBeeswaxException: Query 
> aborted:ExecQueryFInstances rpc query_id=8f461cf08845e57c:32ec8ff300000000 
> failed: Exec() rpc failed: Aborted: ExecQueryFInstances RPC to 
> 127.0.0.1:27002 is cancelled in state SENT
> ....
> -- 2020-11-13 23:05:01,187 INFO     MainThread: Started query 
> 62432dd724112633:bd27d62200000000
> -- executing against localhost:21000
> select c4 from illtypes;
> -- 2020-11-13 23:05:01,243 INFO     MainThread: Started query 
> 8f461cf08845e57c:32ec8ff300000000
> {noformat}
> The problem seems to be that the error that occurs during Prepare() phase is 
> not propagated correctly, and a different internal error wins a race to 
> become the status of the query.
> Attached executor logs and the coordinator logs for the query id. The full 
> coordinator log was too big to attach so I put it here: 
> https://drive.google.com/file/d/1hZVi1AmRJ4Lz7yIeJqVpiXeA_MZTP1it/view?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to