wangbo opened a new issue #4202:
URL: https://github.com/apache/incubator-doris/issues/4202


   **Describe the bug**
   alter table job failed
   ```
           JobId: 14779188
       TableName: xxxxx
      CreateTime: 2020-07-28 15:39:14
      FinishTime: 2020-07-28 15:45:09
       IndexName: xxxxx
         IndexId: 14779189
   OriginIndexId: 14465552
   SchemaVersion: 3:803621913
   TransactionId: 1006404
           State: CANCELLED
             Msg: errCode = 2, detailMessage = schema change task failed after 
try three times: task type: ALTER, status_code: RUNTIME_ERROR, backendId: 
Backend [id=10004, host=xxxxx, heartbeatPort=9050, alive=true], signature: 
14782162
        Progress: N/A
         Timeout: 86400
   ```
   The alter job submit time is '15:39:14'
   
   **Problem Location**
   1 log in BE ```10004```
   ```
   I0728 15:43:21.105201 22930 task_worker_pool.cpp:451] get alter table task, 
signature: 14782162
   I0728 15:43:21.105213 22930 schema_change.cpp:1203] begin to do request 
alter tablet: base_tablet_id=14468525, base_schema_hash=825165665, 
new_tablet_id=14782162, new_schema_hash=803621913, alter_version=
   1, alter_version_hash=0
   W0728 15:43:21.105237 22930 tablet_manager.cpp:1109] tablet does not exists. 
tablet_id=14468525
   W0728 15:43:21.113729 22930 engine_alter_tablet_task.cpp:42] failed to do 
alter task. res=-216 base_tablet_id=14468525, base_schema_hash=825165665, 
new_tablet_id=14782162, new_schema_hash=803621913
   ```
   1.1  doris wants to create a new table id=14782162 based on base table 
(id=```14468525```)
   1.2 base table does not exists, so the alter task failed
   1.3 Failed time is '15:43:21'
   
   2 log in FE
   2.1 Clone Task begins in '15:39:12', the tablet id is ```14468525```
   ```
   2020-07-28 15:39:12,285 INFO 40 
[TabletScheduler.schedulePendingTablets():412] add clone task to agent task 
queue: tablet id: 14468525, schema hash: 825165665, storageMedium: HDD, visible 
version(hash): 1-0, src backend: xxx, src path hash: 796686472853631190, dest 
backend: 3632012, dest path hash: 8397208614267550439
   2020-07-28 15:39:12,314 INFO 826360 [TabletSchedCtx.finishCloneTask():871] 
clone finished: tablet id: 14468525, status: HEALTHY, state: FINISHED, type: 
BALANCE. from backend: 3632016, src path hash: 796686472853631190. to backend: 
3632012, dest path hash: 8397208614267550439
   ```
   
   2.1  When the time is '15:39:38', ```14468525``` is deleted from ```10004``` 
in FE's Catalog
   ```
   2020-07-28 15:39:38,910 INFO 40 
[TabletScheduler.deleteReplicaInternal():911] delete replica. tablet id: 
14468525, backend id: 10004. reason: DECOMMISSION state, force: false
   ```
   
   2.2 alter job begins time is '15:39:50'
   ```
   2020-07-28 15:39:36,880 WARN 28 [AlterJobV2.checkTableStable():196] wait 
table 13908258 to be stable before doing SCHEMA_CHANGE job
   2020-07-28 15:39:50,260 INFO 28 [SchemaChangeJobV2.runPendingJob():309] 
transfer schema change job 14779188 state to WAITING_TXN, watershed txn id: 
1006404
   ```
   It is obvious that even a replica has already delete in FE's catalog, the 
AlterJob would still send a request to a stale be which doesn't contains the 
wanted tablet.
   The main reason is that AlterJob's ```partitionIndexMap``` is generated when 
it's created,  ```partitionIndexMap``` contains BE and Replica Info.
   But the BE and Replica Info may changed when the job stays pending status.
   
   ```Solution```
   I think the best solution is that generate shadowReplica when 
SchemaChangeJobV2.runPendingJob executes


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to