weizuo93 opened a new issue #5562:
URL: https://github.com/apache/incubator-doris/issues/5562


   Description:
   There are 3 BE and 1 FE in doris cluster. The number of replica is 3. If one 
BE is stoped,  the stream load will return the following result:
   ```
   {
       "TxnId": 34149,
       "Label": "1616552269",
       "Status": "Fail",
       "Message": "failed to call frontend service",
       "NumberTotalRows": 200,
       "NumberLoadedRows": 200,
       "NumberFilteredRows": 0,
       "NumberUnselectedRows": 0,
       "LoadBytes": 8935,
       "LoadTimeMs": 23154,
       "BeginTxnTimeMs": 0,
       "StreamLoadPutTimeMs": 3,
       "ReadDataTimeMs": 0,
       "WriteDataTimeMs": 149,
       "CommitAndPublishTimeMs": 0
   }
   
   ```
   
   FE log is as follow:
   ```
   2021-03-24 10:17:49,818 INFO (thrift-server-pool-5|133) 
[DatabaseTransactionMgr.beginTransaction():295] begin transaction: txn id 34149 
with label 1616552269 from coordinator BE: 10.38.167.158, listner id: -1
   2021-03-24 10:17:49,984 INFO (thrift-server-pool-5|133) 
[DatabaseTransactionMgr.commitTransaction():559] transaction:[TransactionState. 
transaction id: 34149, label: 1616552269, db id: 11001, table id list: 11012, 
callback id: -1, coordinator: BE: 10.38.167.158, transaction status: COMMITTED, 
error replicas num: 100, replica ids: 11265,11269,11273,11017,11277, prepare 
time: 1616552269818, commit time: 1616552269980, finish time: -1, reason: ] 
successfully committed
   2021-03-24 10:17:49,986 INFO (PUBLISH_VERSION|19) 
[PublishVersionDaemon.publishVersion():131] send publish tasks for transaction: 
34149
   2021-03-24 10:18:13,071 WARN (thrift-server-pool-5|133) 
[FrontendServiceImpl.loadTxnRollback():848] failed to rollback txn 34149: 
errCode = 2, detailMessage = transaction's state is already COMMITTED, could 
not abort
   2021-03-24 10:18:19,995 INFO (PUBLISH_VERSION|19) 
[DatabaseTransactionMgr.finishTransaction():826] finish transaction 
TransactionState. transaction id: 34149, label: 1616552269, db id: 11001, table 
id list: 11012, callback id: -1, coordinator: BE: 10.38.167.158, transaction 
status: VISIBLE, error replicas num: 100, replica ids: 
11265,11269,11273,11017,11277, prepare time: 1616552269818, commit time: 
1616552269980, finish time: 1616552299994, reason:  successfully
   2021-03-24 11:18:20,995 INFO (txnCleaner|56) 
[DatabaseTransactionMgr.removeExpiredTxns():1087] transaction [34149] is 
expired, remove it from transaction manager
   ```
   
   Coordinator BE log is as follow:
   ```
   I0324 10:17:49.820899 155649 stream_load_executor.cpp:50] begin to execute 
job. label=1616552269, txn_id=34149, query_id=a941b0d30a337020-c6406f0699945b82
   I0324 10:17:49.820930 155649 plan_fragment_executor.cpp:76] Prepare(): 
query_id=a941b0d30a337020-c6406f0699945b82 
fragment_instance_id=a941b0d30a337020-c6406f0699945b83 backend_num=0
   I0324 10:17:49.820993 155649 plan_fragment_executor.cpp:138] Using query 
memory limit: 2.00 GB
   W0324 10:18:00.050282 155649 thrift_rpc_helper.cpp:66] retrying call 
frontend service after 1000 ms, address=TNetworkAddress(hostname=10.38.163.97, 
port=19020), reason=THRIFT_EAGAIN (timed out)
   W0324 10:18:11.050664 155649 thrift_rpc_helper.cpp:79] call frontend service 
failed, address=TNetworkAddress(hostname=10.38.163.97, port=19020), 
reason=THRIFT_EAGAIN (timed out)
   W0324 10:18:13.051308 155649 stream_load.cpp:116] handle streaming load 
failed, id=a941b0d30a337020-c6406f0699945b82, errmsg=failed to call frontend 
service
   ```
   
   According to log, I find that BE called `commit txn rpc`, then FE received 
and finished transaction commit, but BE didn't receive the return of `commit 
txn rpc`successfully. From BE's point of view, `commit txn rpc` is failed and 
then BE called `rollback txn rpc`. In FE, `committed txn` can not execute 
rollback, so BE received the return message of  `rollback txn rpc`from FE is 
`failed to call frontend service`, and then the message will be return to 
client.
   
   Who could tell me why coordinator BE can not receive the return of `commit 
txn rpc` successfully when one BE is stoped.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to