[jira] [Comment Edited] (FLINK-10819) JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure is unstable

Guowei Ma (JIRA) Thu, 25 Jul 2019 03:06:43 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889486#comment-16889486
 ]


Guowei Ma edited comment on FLINK-10819 at 7/25/19 10:05 AM:
-------------------------------------------------------------

It might be a bug according to 
[https://api.travis-ci.org/v3/job/508500560/log.txt]

1. The test is time out because two "READY_MARKER_FILE_PREFIX" files are 
missing.

2. Two tasks, which response for creating the two files can't be deployed 
because the resource is not available.

!image-2019-07-19-17-01-19-758.png!

3. However, the reason why RM has not allocated resources to the two tasks is 
still uncertain. Because two task executors have been registered at the first 
time; there must be 4 free slots.

!image-2019-07-19-17-00-17-194.png!

I have read a few logs, and all have the following three characteristics.

1. Always have a TM registered twice;
2. The taskexecutor, which registers twice uses the old RegistrationID in the 
heartbeat. I already open a Jira [FLINK-13426]
3. When the job starts, it applies for 4 slots at the beginning. However, only 
one TM is available. During the execution of the Source, the SlotPool cancels 
the two Slot requests that have not come. In the next execution, the Scheduler 
issues two SlotRequests, but the two SlotRequests don't return until timeout. 

 

 

 

 

 


was (Author: maguowei):
It might be a bug according to 
https://api.travis-ci.org/v3/job/508500560/log.txt

1. The test is time out because two "READY_MARKER_FILE_PREFIX" files are 
missing.

2. Two tasks, which response for creating the two files can't be deployed 
because the resource is not available.

!image-2019-07-19-17-01-19-758.png!

3. The slots from one TM(34dbf0f8264469af49be8e1dbc2ad811) are not recognized 
by SlotManger. Since this, the two tasks can't be deployed.

!image-2019-07-19-17-00-17-194.png!

4. The TM(34dbf0f8264469af49be8e1dbc2ad811) registers to RM twice.

 

!image-2019-07-19-16-59-10-178.png!

The RM responses two RegistrationResponses to TM. But TM uses different threads 
to deal  with RegistrationResponse. The registrationId of old 
RegistrationResponse override the registrationId of new RegistrationResponse.

 

The simple idea is to use the main thread to process on the TM side. I am still 
thinking about it if there is another method.

 

 

 

 

 

 

> JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure is 
> unstable
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-10819
>                 URL: https://issues.apache.org/jira/browse/FLINK-10819
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 1.7.0
>            Reporter: sunjincheng
>            Assignee: Guowei Ma
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.9.0
>
>         Attachments: image-2019-07-19-16-59-10-178.png, 
> image-2019-07-19-17-00-17-194.png, image-2019-07-19-17-01-19-758.png
>
>
> Found the following error in the process of CI：
> Results :
> Tests in error: 
>  JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure:331 » 
> IllegalArgument
> Tests run: 1463, Failures: 0, Errors: 1, Skipped: 29
> 18:40:55.828 [INFO] 
> ------------------------------------------------------------------------
> 18:40:55.829 [INFO] BUILD FAILURE
> 18:40:55.829 [INFO] 
> ------------------------------------------------------------------------
> 18:40:55.830 [INFO] Total time: 30:19 min
> 18:40:55.830 [INFO] Finished at: 2018-11-07T18:40:55+00:00
> 18:40:56.294 [INFO] Final Memory: 92M/678M
> 18:40:56.294 [INFO] 
> ------------------------------------------------------------------------
> 18:40:56.294 [WARNING] The requested profile "include-kinesis" could not be 
> activated because it does not exist.
> 18:40:56.295 [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test 
> (integration-tests) on project flink-tests_2.11: There are test failures.
> 18:40:56.295 [ERROR] 
> 18:40:56.295 [ERROR] Please refer to 
> /home/travis/build/sunjincheng121/flink/flink-tests/target/surefire-reports 
> for the individual test results.
> 18:40:56.295 [ERROR] -> [Help 1]
> 18:40:56.295 [ERROR] 
> 18:40:56.295 [ERROR] To see the full stack trace of the errors, re-run Maven 
> with the -e switch.
> 18:40:56.295 [ERROR] Re-run Maven using the -X switch to enable full debug 
> logging.
> 18:40:56.295 [ERROR] 
> 18:40:56.295 [ERROR] For more information about the errors and possible 
> solutions, please read the following articles:
> 18:40:56.295 [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> MVN exited with EXIT CODE: 1.
> Trying to KILL watchdog (11329).
> ./tools/travis_mvn_watchdog.sh: line 269: 11329 Terminated watchdog
> PRODUCED build artifacts.
> But after the rerun, the error disappeared. 
> Currently，no specific reasons are found, and will continue to pay attention.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (FLINK-10819) JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure is unstable

Reply via email to