Hongshun Wang created FLINK-32668:
-------------------------------------
Summary: fix up watchdog timeout bug in common.sh(e2e test) ?
Key: FLINK-32668
URL: https://issues.apache.org/jira/browse/FLINK-32668
Project: Flink
Issue Type: Improvement
Components: Build System / CI
Affects Versions: 1.17.1
Reporter: Hongshun Wang
Fix For: 1.17.2
Attachments: image-2023-07-25-15-27-37-441.png
When run e2e test, an error like this occrurs:
!image-2023-07-25-15-27-37-441.png|width=733,height=115!
then I find a problem in the corresponding code:
{code:java}
kill_test_watchdog() {
local watchdog_pid=$(cat $TEST_DATA_DIR/job_watchdog.pid)
echo "Stopping job timeout watchdog (with pid=$watchdog_pid)"
kill $watchdog_pid
}
internal_run_with_timeout() {
local timeout_in_seconds="$1"
local on_failure="$2"
local command_label="$3"
local command="${@:4}"
on_exit kill_test_watchdog
(
command_pid=$BASHPID
(sleep "${timeout_in_seconds}" # set a timeout for this command
echo "${command_label:-"The command '${command}'"} (pid:
$command_pid) did not finish after $timeout_in_seconds seconds."
eval "${on_failure}"
kill "$command_pid") & watchdog_pid=$!
echo $watchdog_pid > $TEST_DATA_DIR/job_watchdog.pid
# invoke
$command
)
}{code}
When {{$command}} completes before the timeout, the watchdog process is killed
successfully. However, when {{$command}} times out, the watchdog process kills
{{$command}} and then exits itself, leaving behind an error message when trying
to kill its own process ID with {{{}kill $watchdog_pid{}}}.
So, I will modify like this:
{code:java}
kill_test_watchdog() {
local watchdog_pid=$(cat $TEST_DATA_DIR/job_watchdog.pid)
if kill -0 $watchdog_pid > /dev/null 2>&1; then
echo "Stopping job timeout watchdog (with pid=$watchdog_pid)"
kill $watchdog_pid
else
echo "watchdog (with pid=$watchdog_pid) does not exist now"
fi
} {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)