[jira] [Commented] (IMPALA-9241) Minicluster service status ambiguous when pids wrap around

ASF subversion and git services (Jira) Fri, 27 Dec 2019 18:18:27 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004375#comment-17004375
 ]


ASF subversion and git services commented on IMPALA-9241:
---------------------------------------------------------

Commit fa7d91fd305bac2dc5c4a145528b6cc45f5972fe in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=fa7d91f ]

IMPALA-9241: Remove pid files on successful shutdown of minicluster

The minicluster init scripts currently keep track of pids
for HDFS, YARN, etc by writing the pid to files for each
service. It uses the pid in the file to see what is running and
needs to shutdown or start. Currently, it does not remove the pid
file after the minicluster shuts down. This means that it would
be reading a zombie pid from the pid file to see if the service
is already running. If the pid is reused by something else, it
can fail to start up a necessary service.

This removes the pid files when the minicluster components shut down
successfully.

Change-Id: I5b14d74df8061b6595b9897df9c9667e3f569e34
Reviewed-on: http://gerrit.cloudera.org:8080/14950
Reviewed-by: Andrew Sherman <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Minicluster service status ambiguous when pids wrap around
> ----------------------------------------------------------
>
>                 Key: IMPALA-9241
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9241
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 3.4.0
>            Reporter: Joe McDonnell
>            Priority: Critical
>              Labels: broken-build
>
> In a recent test run, a large number of tests failed due to being unable to 
> contact the Kudu master. Messages like:
>  
> {noformat}
> query_test/test_acid.py:26: in <module>
>     from tests.common.skip import (SkipIfHive2, SkipIfCatalogV2, SkipIfS3, 
> SkipIfABFS,
> common/skip.py:108: in <module>
>     class SkipIfKudu:
> common/skip.py:112: in SkipIfKudu
>     get_kudu_master_flag("--use_hybrid_clock") == "false",
> common/kudu_test_suite.py:59: in get_kudu_master_flag
>     varz = get_kudu_master_webpage("varz")
> common/kudu_test_suite.py:55: in get_kudu_master_webpage
>     return requests.get(url).text
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/api.py:69:
>  in get
>     return request('get', url, params=params, **kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/api.py:50:
>  in request
>     response = session.request(method=method, url=url, **kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/sessions.py:465:
>  in request
>     resp = self.send(prep, **send_kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/sessions.py:573:
>  in send
>     r = adapter.send(request, **kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/adapters.py:415:
>  in send
>     raise ConnectionError(err, request=request)
> E   ConnectionError: ('Connection aborted.', error(111, 'Connection 
> refused')){noformat}
> I checked the logs/cluster/cdh6-node-1/kudu/master directory, and there was a 
> log file for the minicluster startup to do dataload, but not one for the 
> restart of the minicluster at the end of dataload 
> ([https://github.com/apache/impala/blob/fc4a91cf8c87966a910106dded7e7eb8d215270a/testdata/bin/create-load-data.sh#L717).|https://github.com/apache/impala/blob/fc4a91cf8c87966a910106dded7e7eb8d215270a/testdata/bin/create-load-data.sh#L717)]
>  
> Interestingly, two of the tablet servers did start up as part of that 
> restart, so this was not a universal thing. In fact, quite a few tests ran 
> fine.
> My theory is that this could be due to stale PIDs. When starting up one of 
> the services covered by testdata/cluster/admin, it calls the status function 
> to see if the service is already running 
> ([https://github.com/apache/impala/blob/master/testdata/cluster/admin#L423]):
>  
> {noformat}
> if "$SCRIPT" status &>/dev/null; then  
>   RUNNING=true
> else
>   RUNNING=false
> fi{noformat}
> If it is already RUNNING, it skips the startup. The status call is common 
> across the different services and reads the PID from a file and checks to see 
> if that PID is still running:
>  
>  
> {noformat}
> function status {
>   local PID=$(read_pid)
>   if [[ -z $PID ]]; then
>     echo Not started
>     return 1
>   fi
>   if pid_exists $PID; then
>     echo Running
>   else
>     echo Not Running
>     return 1
>   fi
> }{noformat}
> However, it doesn't delete the pid file when it shuts down. If something 
> happens to be running with that pid when we try to start up, it would think 
> it is already running and fail to start up. Silently!
> This could apply to kudu, hdfs, kms, yarn, etc. It does not apply to hbase, 
> hive, sentry, ranger, as those do not use this framework.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-9241) Minicluster service status ambiguous when pids wrap around

Reply via email to