KEN-LJQ opened a new issue, #13409:
URL: https://github.com/apache/dolphinscheduler/issues/13409

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   When a task finish very soon, like within 1 seconds, may stuck running 
status forever, although it had finish successfully.
   
   while digging into code and log, we found that, when worker receive task, it 
will send two `ack` response first, and send `finish` response while task 
finish.
   But when task finish very fast, those 3 response may receive by master at 
the same time.
   
![企业微信截图_16729927376406](https://user-images.githubusercontent.com/16650282/212637861-1d9f1a4a-81c9-4247-a962-54f88d9319af.png)
   
   So, master may handle `finish` repsonse before `ack` response, which lead to 
wrong task status, and we can see it from db binlog
   
   1. insert a record while submit
   ```
   #230105  6:31:41 server id 13  end_log_pos 476308895 CRC32 0xcd211e98   
Query   thread_id=441009        exec_time=0     error_code=0
   SET TIMESTAMP=1672871501/*!*/;
   INSERT INTO t_ds_task_instance  ( dry_run,
   flag,
   environment_code,
   pid,
   task_params,
   task_type,
   task_instance_priority,
   task_code,
   worker_group,
   state,
   process_instance_id,
   executor_id,
   alert_flag,
   first_submit_time,
   max_retry_times,
   retry_times,
   submit_time,
   name,
   task_definition_version,
   delay_time,
   retry_interval )  VALUES  ( 0,
   1,
   0,
   0,
   '{"type":"HIVE","datasource":5,"sql":"CREATE TABLE if not exists 
`tmp.tmp_table_bss_consult_detail`(\\n`aid` int, \\n`consult_foreign_id` 
string, \\n`consult_durat
   ion_second` int, \\n`consult_start_time` int, \\n`consult_end_time` int, 
\\n`consult_from` int, \\n`consult_vip_level` int, \\n`consult_evaluation` int, 
\\n`consul
   t_evaluation_type` int)\\nPARTITIONED BY (`dt` string)\\nSTORED AS 
parquet","udfs":"","sqlType":"1","title":"","receivers":"","receiversCc":"","showType":"","local
   
Params":[],"connParams":"","preStatements":[],"postStatements":[],"conditionResult":"null","dependence":"null","switchResult":"null","waitStartTimeout":null}',
   'SQL',
   2,
   7484249458146,
   'default',
   0,
   19434,
   3,
   0,
   '2023-01-05 06:31:41.514',
   3,
   0,
   '2023-01-05 06:31:41.514',
   'test',
   2,
   0,
   1 )
   /*!*/;
   ```
   
   2. dispatch to worker
   ```
   #230105  6:31:41 server id 13  end_log_pos 476309341 CRC32 0x9c9c2283   
Query   thread_id=441008        exec_time=0     error_code=0
   SET TIMESTAMP=1672871501/*!*/;
   update t_ds_task_instance
           set host        = '172.16.1.151:1235',
               submit_time = '2023-01-05 06:31:41.527'
           where id = 69502
   /*!*/;
   ```
   4. update state to `running`
   ```
   # at 476320841
   #230105  6:31:42 server id 13  end_log_pos 476322293 CRC32 0x439e70b4   
Query   thread_id=441009        exec_time=0     error_code=0
   SET TIMESTAMP=1672871502/*!*/;
   UPDATE t_ds_task_instance  SET dry_run=0,
   flag=1,
   environment_code=0,
   pid=0,
   task_params='{"type":"HIVE","datasource":5,"sql":"CREATE TABLE if not exists 
`tmp.tmp_table_bss_consult_detail`(\\n`aid` int, \\n`consult_foreign_id` 
string, \\n`c
   onsult_duration_second` int, \\n`consult_start_time` int, 
\\n`consult_end_time` int, \\n`consult_from` int, \\n`consult_vip_level` int, 
\\n`consult_evaluation` int
   , \\n`consult_evaluation_type` int)\\nPARTITIONED BY (`dt` string)\\nSTORED 
AS 
parquet","udfs":"","sqlType":"1","title":"","receivers":"","receiversCc":"","showTyp
   
e":"","localParams":[],"connParams":"","preStatements":[],"postStatements":[],"conditionResult":"null","dependence":"null","switchResult":"null","waitStartTimeout"
   :null}',
   
   task_type='SQL',
   task_instance_priority=2,
   task_code=7484249458146,
   log_path='/var/log/dolphinscheduler2/7484249458144_6/19434/69502.log',
   host='172.16.1.151:1235',
   start_time='2023-01-05 06:31:41',
   
   worker_group='default',
   state=1,
   process_instance_id=19434,
   executor_id=3,
   alert_flag=0,
   
execute_path='/tmp/dolphinscheduler2/exec/process/7484249457760/7484249458144_6/19434/69502',
   first_submit_time='2023-01-05 06:31:42',
   max_retry_times=3,
   retry_times=0,
   
   submit_time='2023-01-05 06:31:42',
   name='test',
   task_definition_version=2,
   delay_time=0,
   retry_interval=1  WHERE id=69502
   /*!*/;
   ```
   5. update state to `success`
   ```
   #230105  6:31:42 server id 13  end_log_pos 476325363 CRC32 0x045eddfc   
Query   thread_id=441009        exec_time=0     error_code=0
   SET TIMESTAMP=1672871502/*!*/;
   UPDATE t_ds_task_instance  SET dry_run=0,
   flag=1,
   environment_code=0,
   pid=0,
   task_params='{"type":"HIVE","datasource":5,"sql":"CREATE TABLE if not exists 
`tmp.tmp_table_bss_consult_detail`(\\n`aid` int, \\n`consult_foreign_id` 
string, \\n`c
   onsult_duration_second` int, \\n`consult_start_time` int, 
\\n`consult_end_time` int, \\n`consult_from` int, \\n`consult_vip_level` int, 
\\n`consult_evaluation` int
   , \\n`consult_evaluation_type` int)\\nPARTITIONED BY (`dt` string)\\nSTORED 
AS 
parquet","udfs":"","sqlType":"1","title":"","receivers":"","receiversCc":"","showTyp
   
e":"","localParams":[],"connParams":"","preStatements":[],"postStatements":[],"conditionResult":"null","dependence":"null","switchResult":"null","waitStartTimeout"
   :null}',
   
   task_type='SQL',
   task_instance_priority=2,
   task_code=7484249458146,
   log_path='/var/log/dolphinscheduler2/7484249458144_6/19434/69502.log',
   host='172.16.1.151:1235',
   start_time='2023-01-05 06:31:41',
   
   worker_group='default',
   state=7,
   process_instance_id=19434,
   executor_id=3,
   alert_flag=0,
   
execute_path='/tmp/dolphinscheduler2/exec/process/7484249457760/7484249458144_6/19434/69502',
   first_submit_time='2023-01-05 06:31:42',
   max_retry_times=3,
   retry_times=0,
   var_pool='[]',
   submit_time='2023-01-05 06:31:42',
   name='test',
   task_definition_version=2,
   delay_time=0,
   retry_interval=1,
   end_time='2023-01-05 06:31:42'  WHERE id=69502
   /*!*/;
   ```
   6. update state back to `running`
   ```
   #230105  6:31:42 server id 13  end_log_pos 476326921 CRC32 0x0207b2c5   
Query   thread_id=441008        exec_time=0     error_code=0
   SET TIMESTAMP=1672871502/*!*/;
   UPDATE t_ds_task_instance  SET dry_run=0,
   flag=1,
   environment_code=0,
   pid=0,
   task_params='{"type":"HIVE","datasource":5,"sql":"CREATE TABLE if not exists 
`tmp.tmp_table_bss_consult_detail`(\\n`aid` int, \\n`consult_foreign_id` 
string, \\n`c
   onsult_duration_second` int, \\n`consult_start_time` int, 
\\n`consult_end_time` int, \\n`consult_from` int, \\n`consult_vip_level` int, 
\\n`consult_evaluation` int
   , \\n`consult_evaluation_type` int)\\nPARTITIONED BY (`dt` string)\\nSTORED 
AS 
parquet","udfs":"","sqlType":"1","title":"","receivers":"","receiversCc":"","showTyp
   
e":"","localParams":[],"connParams":"","preStatements":[],"postStatements":[],"conditionResult":"null","dependence":"null","switchResult":"null","waitStartTimeout"
   :null}',
   task_type='SQL',
   task_instance_priority=2,
   task_code=7484249458146,
   log_path='/var/log/dolphinscheduler2/7484249458144_6/19434/69502.log',
   host='172.16.1.151:1235',
   start_time='2023-01-05 06:31:41',
   worker_group='default',
   state=1,
   process_instance_id=19434,
   executor_id=3,
   alert_flag=0,
   first_submit_time='2023-01-05 06:31:42',
   max_retry_times=3,
   retry_times=0,
   submit_time='2023-01-05 06:31:42',
   name='test',
   task_definition_version=2,
   delay_time=0,
   retry_interval=1  WHERE id=69502
   ```
   
   ### What you expected to happen
   
   The final state of task should be `success`, which state=7
   
   ### How to reproduce
   
   Kind of hard to reproduce, we can try create a sql task with short run 
statment like `create table` , and create a branch merge like belove
   
![企业微信截图_16729861779567](https://user-images.githubusercontent.com/16650282/212640964-9e801969-cb13-45ac-b158-a2c86f8ae4d2.png)
   
   then return this process multiple times, untill task node stuck at `running` 
state forever
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   2.0.x
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to