[ 
https://issues.apache.org/jira/browse/IMPALA-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell updated IMPALA-9241:
----------------------------------
    Target Version: Impala 3.4.0

> Minicluster service status ambiguous when pids wrap around
> ----------------------------------------------------------
>
>                 Key: IMPALA-9241
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9241
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 3.4.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Critical
>              Labels: broken-build
>             Fix For: Impala 3.4.0
>
>
> In a recent test run, a large number of tests failed due to being unable to 
> contact the Kudu master. Messages like:
>  
> {noformat}
> query_test/test_acid.py:26: in <module>
>     from tests.common.skip import (SkipIfHive2, SkipIfCatalogV2, SkipIfS3, 
> SkipIfABFS,
> common/skip.py:108: in <module>
>     class SkipIfKudu:
> common/skip.py:112: in SkipIfKudu
>     get_kudu_master_flag("--use_hybrid_clock") == "false",
> common/kudu_test_suite.py:59: in get_kudu_master_flag
>     varz = get_kudu_master_webpage("varz")
> common/kudu_test_suite.py:55: in get_kudu_master_webpage
>     return requests.get(url).text
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/api.py:69:
>  in get
>     return request('get', url, params=params, **kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/api.py:50:
>  in request
>     response = session.request(method=method, url=url, **kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/sessions.py:465:
>  in request
>     resp = self.send(prep, **send_kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/sessions.py:573:
>  in send
>     r = adapter.send(request, **kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/adapters.py:415:
>  in send
>     raise ConnectionError(err, request=request)
> E   ConnectionError: ('Connection aborted.', error(111, 'Connection 
> refused')){noformat}
> I checked the logs/cluster/cdh6-node-1/kudu/master directory, and there was a 
> log file for the minicluster startup to do dataload, but not one for the 
> restart of the minicluster at the end of dataload 
> ([https://github.com/apache/impala/blob/fc4a91cf8c87966a910106dded7e7eb8d215270a/testdata/bin/create-load-data.sh#L717).|https://github.com/apache/impala/blob/fc4a91cf8c87966a910106dded7e7eb8d215270a/testdata/bin/create-load-data.sh#L717)]
>  
> Interestingly, two of the tablet servers did start up as part of that 
> restart, so this was not a universal thing. In fact, quite a few tests ran 
> fine.
> My theory is that this could be due to stale PIDs. When starting up one of 
> the services covered by testdata/cluster/admin, it calls the status function 
> to see if the service is already running 
> ([https://github.com/apache/impala/blob/master/testdata/cluster/admin#L423]):
>  
> {noformat}
> if "$SCRIPT" status &>/dev/null; then  
>   RUNNING=true
> else
>   RUNNING=false
> fi{noformat}
> If it is already RUNNING, it skips the startup. The status call is common 
> across the different services and reads the PID from a file and checks to see 
> if that PID is still running:
>  
>  
> {noformat}
> function status {
>   local PID=$(read_pid)
>   if [[ -z $PID ]]; then
>     echo Not started
>     return 1
>   fi
>   if pid_exists $PID; then
>     echo Running
>   else
>     echo Not Running
>     return 1
>   fi
> }{noformat}
> However, it doesn't delete the pid file when it shuts down. If something 
> happens to be running with that pid when we try to start up, it would think 
> it is already running and fail to start up. Silently!
> This could apply to kudu, hdfs, kms, yarn, etc. It does not apply to hbase, 
> hive, sentry, ranger, as those do not use this framework.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to