[
https://issues.apache.org/jira/browse/IMPALA-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joe McDonnell updated IMPALA-9241:
----------------------------------
Target Version: Impala 3.4.0
> Minicluster service status ambiguous when pids wrap around
> ----------------------------------------------------------
>
> Key: IMPALA-9241
> URL: https://issues.apache.org/jira/browse/IMPALA-9241
> Project: IMPALA
> Issue Type: Bug
> Components: Infrastructure
> Affects Versions: Impala 3.4.0
> Reporter: Joe McDonnell
> Assignee: Joe McDonnell
> Priority: Critical
> Labels: broken-build
> Fix For: Impala 3.4.0
>
>
> In a recent test run, a large number of tests failed due to being unable to
> contact the Kudu master. Messages like:
>
> {noformat}
> query_test/test_acid.py:26: in <module>
> from tests.common.skip import (SkipIfHive2, SkipIfCatalogV2, SkipIfS3,
> SkipIfABFS,
> common/skip.py:108: in <module>
> class SkipIfKudu:
> common/skip.py:112: in SkipIfKudu
> get_kudu_master_flag("--use_hybrid_clock") == "false",
> common/kudu_test_suite.py:59: in get_kudu_master_flag
> varz = get_kudu_master_webpage("varz")
> common/kudu_test_suite.py:55: in get_kudu_master_webpage
> return requests.get(url).text
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/api.py:69:
> in get
> return request('get', url, params=params, **kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/api.py:50:
> in request
> response = session.request(method=method, url=url, **kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/sessions.py:465:
> in request
> resp = self.send(prep, **send_kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/sessions.py:573:
> in send
> r = adapter.send(request, **kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/adapters.py:415:
> in send
> raise ConnectionError(err, request=request)
> E ConnectionError: ('Connection aborted.', error(111, 'Connection
> refused')){noformat}
> I checked the logs/cluster/cdh6-node-1/kudu/master directory, and there was a
> log file for the minicluster startup to do dataload, but not one for the
> restart of the minicluster at the end of dataload
> ([https://github.com/apache/impala/blob/fc4a91cf8c87966a910106dded7e7eb8d215270a/testdata/bin/create-load-data.sh#L717).|https://github.com/apache/impala/blob/fc4a91cf8c87966a910106dded7e7eb8d215270a/testdata/bin/create-load-data.sh#L717)]
>
> Interestingly, two of the tablet servers did start up as part of that
> restart, so this was not a universal thing. In fact, quite a few tests ran
> fine.
> My theory is that this could be due to stale PIDs. When starting up one of
> the services covered by testdata/cluster/admin, it calls the status function
> to see if the service is already running
> ([https://github.com/apache/impala/blob/master/testdata/cluster/admin#L423]):
>
> {noformat}
> if "$SCRIPT" status &>/dev/null; then
> RUNNING=true
> else
> RUNNING=false
> fi{noformat}
> If it is already RUNNING, it skips the startup. The status call is common
> across the different services and reads the PID from a file and checks to see
> if that PID is still running:
>
>
> {noformat}
> function status {
> local PID=$(read_pid)
> if [[ -z $PID ]]; then
> echo Not started
> return 1
> fi
> if pid_exists $PID; then
> echo Running
> else
> echo Not Running
> return 1
> fi
> }{noformat}
> However, it doesn't delete the pid file when it shuts down. If something
> happens to be running with that pid when we try to start up, it would think
> it is already running and fail to start up. Silently!
> This could apply to kudu, hdfs, kms, yarn, etc. It does not apply to hbase,
> hive, sentry, ranger, as those do not use this framework.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]