I did some more investigation and the last few commits (especially those
on
Friday) [1] look pretty harmless. Also the time of the hangs seems quite
diverse. I’ve looked at the console output of the last 12 hangs and
copied a
few lines out that preceded the capturing of the thread dumps. I don’t
really see any patterns. In one case there seems to have a lack of space
reported by HDFS, but I’m not sure if that’s a temporary problem or
if
that’s part of the bigger problem.
I’ve attached the links to the console output and the lines I copied
and I’m
happy to take suggestions what to do next …
Thanks,
Till
[1] https://git-wip-us.apache.org/repos/asf?p=asterixdb.git
--
https://asterix-jenkins.ics.uci.edu/job/asterix-gerrit-integration-tests/295/console
00:56:41.801 INFO: Starting [TEST]: recovery_ddl/temp_dataset_recovery
...
02:03:50.080 executing script 'Capture Thread Dumps'
https://asterix-jenkins.ics.uci.edu/job/asterix-gerrit-integration-tests/296/console
00:38:29.735 INFO: Starting [TEST]:
recover_after_abort/primary_plus_rtree_index_insert_and_delete ...
02:03:44.009 executing script 'Capture Thread Dumps'
https://asterix-jenkins.ics.uci.edu/job/asterix-gerrit-integration-tests/298/console
00:24:40.118 at java.lang.Thread.run(Thread.java:745)
02:03:55.458 executing script 'Capture Thread Dumps'
https://asterix-jenkins.ics.uci.edu/job/asterix-gerrit-integration-tests/300/console
02:03:14.569 2016-08-12 22:08:35,303 WARN
[org.apache.hadoop.hdfs.server.namenode.FSNamesystem$NameNodeResourceMonitor@3be8821f]
namenode.NameNodeResourceChecker
(NameNodeResourceChecker.java:isResourceAvailable(89)) - Space available
on volume 'null' is 17981440, which is below the configured reserved
amount 104857600
02:03:14.569 2016-08-12 22:08:35,303 WARN
[org.apache.hadoop.hdfs.server.namenode.FSNamesystem$NameNodeResourceMonitor@3be8821f]
namenode.FSNamesystem (FSNamesystem.java:run(4018)) - NameNode low on
available disk space. Already in safe mode.
02:03:15.833 executing script 'Capture Thread Dumps'
https://asterix-jenkins.ics.uci.edu/job/asterix-gerrit-integration-tests/302/console
00:40:08.703 SLF4J: Class path contains multiple SLF4J bindings.
00:40:08.704 SLF4J: Found binding in
[jar:file:/home/jenkins/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
00:40:08.704 SLF4J: Found binding in
[jar:file:/home/jenkins/.m2/repository/org/slf4j/slf4j-jcl/1.6.3/slf4j-jcl-1.6.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
00:40:08.704 SLF4J: See
http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
02:03:00.955 executing script 'Capture Thread Dumps'
https://asterix-jenkins.ics.uci.edu/job/asterix-gerrit-integration-tests/305/console
00:26:08.302 Aug 13, 2016 3:41:28 AM
org.apache.asterix.test.aql.TestExecutor executeTest
00:26:08.302 INFO: Starting [TEST]:
query_after_restart/dataset-with-meta-record ...
02:02:34.039 executing script 'Capture Thread Dumps'
https://asterix-jenkins.ics.uci.edu/job/asterix-gerrit-integration-tests/306/console
02:02:51.189 Waiting for new AsterixDB Instance to start .
02:02:52.029 2016-08-13 05:42:00,461 INFO [Node Status Updater]
nodemanager.NodeStatusUpdaterImpl
(NodeStatusUpdaterImpl.java:getNodeStatusAndUpdateContainersInContext(314))
- Sending out status for container: container_id { app_attempt_id {
application_id { id: 1 cluster_timestamp: 1471063321301 } attemptId: 1 }
id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
02:02:52.030 2016-08-13 05:42:00,461 INFO [Node Status Updater]
nodemanager.NodeStatusUpdaterImpl
(NodeStatusUpdaterImpl.java:getNodeStatusAndUpdateContainersInContext(314))
- Sending out status for container: container_id { app_attempt_id {
application_id { id: 1 cluster_timestamp: 1471063321301 } attemptId: 1 }
id: 3 } state: C_RUNNING diagnostics: "" exit_status: -1000
02:02:52.179 executing script 'Capture Thread Dumps'
https://asterix-jenkins.ics.uci.edu/job/asterix-gerrit-integration-tests/310/console
00:54:47.092 SLF4J: Class path contains multiple SLF4J bindings.
00:54:47.093 SLF4J: Found binding in
[jar:file:/home/jenkins/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
00:54:47.093 SLF4J: Found binding in
[jar:file:/home/jenkins/.m2/repository/org/slf4j/slf4j-jcl/1.6.3/slf4j-jcl-1.6.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
00:54:47.093 SLF4J: See
http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
02:02:28.914 executing script 'Capture Thread Dumps'
https://asterix-jenkins.ics.uci.edu/job/asterix-gerrit-integration-tests/312/console
00:20:29.319 Caused by:
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /Asterix/asterix/state
00:20:29.319 at
org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
00:20:29.319 at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
00:20:29.320 at
org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
00:20:29.320 at
org.apache.asterix.event.service.ZooKeeperService.removeAsterixInstance(ZooKeeperService.java:205)
00:20:29.320 at
org.apache.asterix.event.service.ZooKeeperService.updateAsterixInstance(ZooKeeperService.java:233)
00:20:29.320 at
org.apache.asterix.installer.command.StartCommand.execCommand(StartCommand.java:68)
00:20:29.320 at
org.apache.asterix.installer.command.AbstractCommand.execute(AbstractCommand.java:38)
00:20:29.320 at
org.apache.asterix.installer.command.CommandHandler.processCommand(CommandHandler.java:86)
00:20:29.320 at
org.apache.asterix.installer.test.AsterixInstallerIntegrationUtil.executeCommand(AsterixInstallerIntegrationUtil.java:220)
00:20:29.320 ... 38 more
00:20:29.320 ...Unexpected!
00:20:29.338 Aug 13, 2016 6:53:29 PM
org.apache.asterix.installer.test.AsterixLifecycleIT
test_1_StopActiveInstance
00:20:29.338 INFO: Starting test: test_1_StopActiveInstance
00:20:43.996 Aug 13, 2016 6:53:44 PM
org.apache.asterix.installer.test.AsterixLifecycleIT
test_2_StartActiveInstance
00:20:43.996 INFO: Starting test: test_2_StartActiveInstance
00:20:57.766 Aug 13, 2016 6:53:57 PM
org.apache.asterix.installer.test.AsterixLifecycleIT
test_3_DeleteActiveInstance
00:20:57.767 INFO: Starting test: test_3_DeleteActiveInstance
00:20:58.093 Aug 13, 2016 6:53:58 PM
org.apache.asterix.installer.test.AsterixLifecycleIT
test_3_DeleteActiveInstance
00:20:58.093 INFO: PASSED: test_3_DeleteActiveInstance
02:02:29.101 executing script 'Capture Thread Dumps'
https://asterix-jenkins.ics.uci.edu/job/asterix-gerrit-integration-tests/314/console
01:00:35.717 Running
org.apache.asterix.installer.test.AsterixExternalLibraryIT
01:00:36.335 log4j:WARN No appenders could be found for logger
(org.apache.asterix.installer.command.AbstractCommand).
01:00:36.335 log4j:WARN Please initialize the log4j system properly.
01:00:40.562 SLF4J: Class path contains multiple SLF4J bindings.
01:00:40.562 SLF4J: Found binding in
[jar:file:/home/jenkins/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
01:00:40.562 SLF4J: Found binding in
[jar:file:/home/jenkins/.m2/repository/org/slf4j/slf4j-jcl/1.6.3/slf4j-jcl-1.6.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
01:00:40.563 SLF4J: See
http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
02:02:46.954 executing script 'Capture Thread Dumps'
https://asterix-jenkins.ics.uci.edu/job/asterix-gerrit-integration-tests/315/console
00:30:25.774 Caused by: java.lang.Exception: INFO: Stopped Asterix
instance: nc1
00:30:25.775 INFO: Name:nc1
00:30:25.775 Created:Sun Aug 14 15:07:33 UTC 2016
00:30:25.775 Web-Url:http://127.0.0.1:19001
00:30:25.775 State:ACTIVE (Sun Aug 14 15:07:57 UTC 2016)
00:30:25.775
00:30:25.775 ERROR: KeeperErrorCode = ConnectionLoss for /Asterix/nc1
00:30:25.775
00:30:25.775 at
org.apache.asterix.test.aql.TestExecutor.executeTest(TestExecutor.java:617)
00:30:25.775 ... 42 more
00:30:25.775 ...Unexpected!
00:30:25.796 Aug 14, 2016 3:08:25 PM
org.apache.asterix.test.aql.TestExecutor executeTest
00:30:25.796 INFO: Starting [TEST]: query_after_restart/big_object_20M
...
02:02:39.073 executing script 'Capture Thread Dumps'
https://asterix-jenkins.ics.uci.edu/job/asterix-gerrit-integration-tests/316/console
00:42:58.474 Running
org.apache.asterix.installer.test.AsterixExternalLibraryIT
00:42:59.099 log4j:WARN No appenders could be found for logger
(org.apache.asterix.installer.command.AbstractCommand).
00:42:59.099 log4j:WARN Please initialize the log4j system properly.
00:43:03.318 SLF4J: Class path contains multiple SLF4J bindings.
00:43:03.318 SLF4J: Found binding in
[jar:file:/home/jenkins/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
00:43:03.319 SLF4J: Found binding in
[jar:file:/home/jenkins/.m2/repository/org/slf4j/slf4j-jcl/1.6.3/slf4j-jcl-1.6.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
00:43:03.320 SLF4J: See
http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
02:02:43.819 executing script 'Capture Thread Dumps'
On 14 Aug 2016, at 12:05, Till Westmann wrote:
Out of the last 30 integration tests [1] 15 timed out (and 5 had test
failures).
That seems like a high number and I’m wondering if there could be a
common cause.
If anybody has some insight into one of the hangs, please share here,
so that we
can hopefully identify commonalities (if they exist ..).
Cheers,
Till
[1]
https://asterix-jenkins.ics.uci.edu/job/asterix-gerrit-integration-tests/