[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

Adam Szita (JIRA) Wed, 13 Dec 2017 03:46:49 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289154#comment-16289154
 ]


Adam Szita commented on HIVE-18263:
-----------------------------------

I believe the description of HIVE-15102 says exactly what happens and matches 
what I have been observing, but I disagree with the proposed solution and the 
conclusion as well.

I'm showing here an example of this issue through slave with IP 35.225.218.206, 
that was killed during a job unnecessarily. I'm also providing some snippets of 
logs of 
[PreCommit-HIVE-Build/8176|https://builds.apache.org/job/PreCommit-HIVE-Build/8176/consoleFull]


Tomcat was just started and the first job assigned to it was 
PreCommit-HIVE-Build/8176/.
On hiveptest-server-upstream, /data/hiveptest/hosts file contains 
35.225.218.206 even right now, and was contained on 2017-12-11 morning as well.
Although I don't have all the server logs from that time now, we will see that 
upon initalization this host was killed, and then a new one was created with 
same IP.

Jenkins has started running the job at 2017-12-11T09:31:55Z as seen at the 
beginning of the logs:
{code}
....
[INFO] Copying spring-webmvc-3.2.16.RELEASE.jar to 
/home/jenkins/jenkins-slave/workspace/PreCommit-HIVE-Build/hive/build/hive/testutils/ptest2/target/lib/spring-webmvc-3.2.16.RELEASE.jar
[INFO] Copying velocity-1.7.jar to 
/home/jenkins/jenkins-slave/workspace/PreCommit-HIVE-Build/hive/build/hive/testutils/ptest2/target/lib/velocity-1.7.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.239 s
[INFO] Finished at: 2017-12-11T09:31:55Z
[INFO] Final Memory: 37M/913M
[INFO] ------------------------------------------------------------------------
+ local 
'PTEST_CLASSPATH=/home/jenkins/jenkins-slave/workspace/PreCommit-HIVE-Build/hive/build/hive/testutils/ptest2/target/hive-ptest-3.0-classes.jar:/home/jenkins/jenkins-slave/workspace/PreCommit-HIVE-Build/hive/build/hive/testutils/ptest2/target/lib/*'
+ java -cp 
'/home/jenkins/jenkins-slave/workspace/PreCommit-HIVE-Build/hive/build/hive/testutils/ptest2/target/hive-ptest-3.0-classes.jar:/home/jenkins/jenkins-slave/workspace/PreCommit-HIVE-Build/hive/build/hive/testutils/ptest2/target/lib/*'
 org.apache....
{code}

Here unfortunately we don't have logs from the server, what happens is 
basically the server is reading the hosts file, terminating any running nodes 
that are found in this file and then creating 12 new executors.
This takes time because only 2 executors are created simultaneously, and there 
is a wait time in between them, hence the log of jenkins shows what happens 
when ptest already created the slaves: (you can see 35.225.218.206 in the 
middle)

{code}
2017-12-11 09:49:35,101  INFO [TestExecutor] 
LocalCommand.awaitProcessCompletion:67 Finished LocalCommandId=46. 
ElapsedTime(ms)=296543
2017-12-11 09:49:35,101 DEBUG [TestExecutor] PrepPhase.execute:68 Deleting 
/data/hiveptest/working/scratch/source-prep.sh: true
2017-12-11 09:49:35,110  INFO [HostExecutor 3] LocalCommand.<init>:45 Starting 
LocalCommandId=48: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 104.198.217.87 'pkill -f java'
2017-12-11 09:49:35,111  INFO [HostExecutor 0] LocalCommand.<init>:45 Starting 
LocalCommandId=50: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.184.147.31 'pkill -f java'
2017-12-11 09:49:35,111  INFO [HostExecutor 1] LocalCommand.<init>:45 Starting 
LocalCommandId=49: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.224.37.167 'pkill -f java'
2017-12-11 09:49:35,111  INFO [HostExecutor 2] LocalCommand.<init>:45 Starting 
LocalCommandId=47: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.184.14.247 'pkill -f java'
2017-12-11 09:49:35,112  INFO [HostExecutor 6] LocalCommand.<init>:45 Starting 
LocalCommandId=55: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.184.205.41 'pkill -f java'
2017-12-11 09:49:35,112  INFO [HostExecutor 8] LocalCommand.<init>:45 Starting 
LocalCommandId=54: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.192.223.60 'pkill -f java'
2017-12-11 09:49:35,112  INFO [HostExecutor 7] LocalCommand.<init>:45 Starting 
LocalCommandId=53: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.192.216.79 'pkill -f java'
2017-12-11 09:49:35,113  INFO [HostExecutor 4] LocalCommand.<init>:45 Starting 
LocalCommandId=51: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.225.218.206 'pkill -f java'
2017-12-11 09:49:35,113  INFO [HostExecutor 5] LocalCommand.<init>:45 Starting 
LocalCommandId=52: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.188.119.125 'pkill -f java'
2017-12-11 09:49:35,113  INFO [HostExecutor 9] LocalCommand.<init>:45 Starting 
LocalCommandId=56: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.225.33.208 'pkill -f java'
2017-12-11 09:49:35,114  INFO [HostExecutor 10] LocalCommand.<init>:45 Starting 
LocalCommandId=57: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.224.189.104 'pkill -f java'
2017-12-11 09:49:35,119  INFO [HostExecutor 11] LocalCommand.<init>:45 Starting 
LocalCommandId=58: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 104.198.248.189 'pkill -f java'
{code}

Our slave is now executing tests without a problem, the last good execution 
happens at 2017-12-11 10:29:29:
{code}
2017-12-11 10:29:29,710  INFO [HostExecutor 44] 
HostExecutor.executeTestBatch:261 Completed executing tests for batch 
[231_TestOperationLoggingAPIWithTez] on host 35.225.218.206. 
ElapsedTime(ms)=168594
{code}
The next log entry about this host shows an error:
{code}
2017-12-11 10:41:23,413  WARN [HostExecutor 45] SSHCommandExecutor.execute:67 
Command exited with 255, will retry: SSHCommand [command=bash 
/home/hiveptest/35.225.218.206-hiveptest-1/scratch/hiveptest-226_UTBatch_itests__hive-unit_9_tests.sh,
 getHost()=35.225.218.206, getInstance()=1]
{code}

..the host was killed in the meantime. And here we can see the server logs:

{code}
2017-12-11 10:31:50 INFO  [CloudExecutionContextProvider-BackgroundWorker] 
CloudExecutionContextProvider:363 - Submitting termination for 
{id=https://www.googleapis.com/compute/v1/projects/gcp-hive-upstream/zones/us-central1-a/instances/hive-ptest-slaves-c17,
 ....... group=hive-ptest-slaves, 
imageId=https://www.googleapis.com/compute/v1/projects/gcp-hive-upstream/global/images/hiveptest-slave-debian-8-20161003,
 status=RUNNING, loginPort=22, hostname=hive-ptest-slaves-c17, 
privateAddresses=[10.128.0.15], publicAddresses=[35.225.218.206], ...........
2017-12-11 10:31:50 WARN  [CloudExecutionContextProvider-BackgroundWorker] 
CloudExecutionContextProvider:339 - Found zombie node: 
{id=https://www.googleapis.com/compute/v1/projects/gcp-hive-upstream/zones/us-central1-a/instances/hive-ptest-slaves-c17,
 providerId=3923746733594766760, 
uri=https://www.googleapis.com/compute/v1/projects/gcp-hive-upstream/zones/us-central1-a/instances/hive-ptest-slaves-c17,
 name=hive-ptest-slaves-c17, ........ group=hive-ptest-slaves, 
imageId=https://www.googleapis.com/compute/v1/projects/gcp-hive-upstream/global/images/hiveptest-slave-debian-8-20161003,
 status=RUNNING, loginPort=22, hostname=hive-ptest-slaves-c17, 
privateAddresses=[10.128.0.15], publicAddresses=[35.225.218.206], ...... 
previously terminated at Mon Dec 11 09:34:27 UTC 2017
{code}

I want to highlight *previously terminated at Mon Dec 11 09:34:27 UTC 2017*.

So:
# 35.225.218.206 was in /data/hiveptest/hosts file and also running when tomcat 
was started at 9:30
# so upon startup hiveptest terminated it, and put its IP into mTerminatedHosts
# hiveptest started the hourly caretaker thread
# hiveptest created 12 new executors, one of them with IP 35.225.218.206
# tests were executed for an hour for job 8176
# after an hour (at 10:31) the caretaker took a look on mTerminatedHosts and 
saw 35.225.218.206 in there, but also running. It thought that a previous 
termination have failed so it is retrying now..
# our perfectly healthy host is now terminated, making the whole job running 
longer.

So it is very much likely to happen that a the same IP is given to a new node.
The solution in HIVE-15102 proposes to remove the IP from mTerminatedHosts when 
termination for it was submitted. This is actually against the whole point, 
(which I think is to try and terminate zombies that are costly to be alive 
(instructed to terminate but still alive)).

I'd rather keep this functionality intact, and workaround this issue by 
removing the IP from mTerminatedHosts when we create a new slave with this IP. 
If the creation succeeded we can be sure that there is no zombie host on that 
IP, but a fresh one.

> Ptest execution are multiple times slower sometimes due to dying executor 
> slaves
> --------------------------------------------------------------------------------
>
>                 Key: HIVE-18263
>                 URL: https://issues.apache.org/jira/browse/HIVE-18263
>             Project: Hive
>          Issue Type: Bug
>          Components: Testing Infrastructure
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>         Attachments: HIVE-18263.0.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. 
> Usually it should take about 1.5 hours, but in some cases it took over 4-5 
> hours.
> Looking in the logs of one such execution I've seen that some commands that 
> were sent to test executing slaves returned 255. Here this typically means 
> that there is unknown return code for the remote call since hiveptest-server 
> can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while 
> running the job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of 
> slaves. It also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed 
> ([mTerminatedHosts 
> map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93]
>  contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of 
> hosts that were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the 
> file are seen as running slaves, ptest will terminate these first so it can 
> begin with a fresh start
> ** The IPs of these terminated instances already make their way into 
> {{mTerminatedHosts}} upon initialization...
> * The cloud provider may reuse some older IPs, so it is not too rare that the 
> same IP that belonged to a terminated host is assigned to a new one.
> This is problematic: Hive ptest's slave caretaker thread kicks in every 60 
> minutes and might see a running host that has the same IP as an old slave had 
> which was terminated at startup. It will think that this host should be 
> terminated since it already tried 60 minutes ago as its IP is in 
> {{mTerminatedHosts}}
> We have to fix this by making sure that if a new slave is created, we check 
> the contents of {{mTerminatedHosts}} and remove this IP from it if it is 
> there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

Reply via email to