[jira] [Commented] (SLING-2945) ClusterLoadTest failures
[ https://issues.apache.org/jira/browse/SLING-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701883#comment-13701883 ] Robert Munteanu commented on SLING-2945: [~egli] - I've taken a look at the tests and tried to implement them using the RetryLoop, for maximum flexibility. I'm not sure if the semantics are preserved ( the RetryLoop waits for each instance on each iteration, instead of waiting for all instances once at the start of the iteration ). The tests consistently pass for me, tested with a heavy CPU/IO process running on the same time, whereas they were failing before. I also needed to add a try/catch to the dumpRepo calls, they failed probably due to concurrent session access ( will paste stack trace below ). Anyway, I'd appreciate if you could take a look at the patch and let me know if I can apply it as-is or if it needs rework. javax.jcr.InvalidItemStateException: Item does not exist anymore: 4b3d508e-5e4c-4385-a6d4-68f07d8dce89 at org.apache.jackrabbit.core.ItemImpl.itemSanityCheck(ItemImpl.java:116) at org.apache.jackrabbit.core.ItemImpl.perform(ItemImpl.java:90) at org.apache.jackrabbit.core.NodeImpl.getNodes(NodeImpl.java:2141) at org.apache.sling.discovery.impl.setup.Instance.dump(Instance.java:393) at org.apache.sling.discovery.impl.setup.Instance.dump(Instance.java:396) at org.apache.sling.discovery.impl.setup.Instance.dump(Instance.java:396) at org.apache.sling.discovery.impl.setup.Instance.dump(Instance.java:396) at org.apache.sling.discovery.impl.setup.Instance.dump(Instance.java:396) at org.apache.sling.discovery.impl.setup.Instance.dump(Instance.java:396) at org.apache.sling.discovery.impl.setup.Instance.dump(Instance.java:396) at org.apache.sling.discovery.impl.setup.Instance.dump(Instance.java:396) at org.apache.sling.discovery.impl.setup.Instance.dump(Instance.java:396) at org.apache.sling.discovery.impl.setup.Instance.dump(Instance.java:396) at org.apache.sling.discovery.impl.setup.Instance.dumpRepo(Instance.java:355) at org.apache.sling.discovery.impl.cluster.ClusterLoadTest.doTest(ClusterLoadTest.java:145) at org.apache.sling.discovery.impl.cluster.ClusterLoadTest.testSevenInstances(ClusterLoadTest.java:102) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:592) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) at org.junit.rules.Verifier$1.evaluate(Verifier.java:33) at org.junit.rules.RunRules.evaluate(RunRules.java:18) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) ClusterLoadTest failures Key: SLING-2945 URL: https://issues.apache.org/jira/browse/SLING-2945 Project: Sling Issue Type: Bug Components: Extensions Reporter: Robert Munteanu Assignee: Stefan Egli
[jira] [Commented] (SLING-2945) ClusterLoadTest failures
[ https://issues.apache.org/jira/browse/SLING-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701899#comment-13701899 ] Stefan Egli commented on SLING-2945: [~rombert], excellent. That patch looks good. I might add some more asserts to the Condition there to check not only the count but if the correct instances are visible. But I can apply that after patch. Re the exception: jup, I should probably have done some more safety-checks there - but the catch is fine for now. ClusterLoadTest failures Key: SLING-2945 URL: https://issues.apache.org/jira/browse/SLING-2945 Project: Sling Issue Type: Bug Components: Extensions Reporter: Robert Munteanu Assignee: Stefan Egli Attachments: SLING-2945-retry-loop.patch This test has failed recently on Jenkins at [1], with the message: --- T E S T S --- Running org.apache.sling.discovery.impl.cluster.ClusterLoadTest Build timed out (after 150 minutes). Marking the build as failed. Results : Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 To debug infra issues, I've also set up a Travis CI build at [2] . This test also fails there, but the error message is clearer: Caused by: javax.jcr.PathNotFoundException: /var/discovery/impl . The full build log is at [3]. I also have errors when repeatedly running the test locally, so it's not just CI flakiness. [1] : https://builds.apache.org/view/S-Z/view/Sling/job/sling-trunk-1.6/1715/console [2]: http://travis-ci.org/rombert/sling/ [3]: https://s3.amazonaws.com/archive.travis-ci.org/jobs/8746894/log.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (SLING-2945) ClusterLoadTest failures
[ https://issues.apache.org/jira/browse/SLING-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701951#comment-13701951 ] Robert Munteanu commented on SLING-2945: Thanks for the review, I've committed the changes in http://svn.apache.org/viewvc?view=revisionrevision=r1500670 ClusterLoadTest failures Key: SLING-2945 URL: https://issues.apache.org/jira/browse/SLING-2945 Project: Sling Issue Type: Bug Components: Extensions Reporter: Robert Munteanu Assignee: Stefan Egli Attachments: SLING-2945-retry-loop.patch This test has failed recently on Jenkins at [1], with the message: --- T E S T S --- Running org.apache.sling.discovery.impl.cluster.ClusterLoadTest Build timed out (after 150 minutes). Marking the build as failed. Results : Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 To debug infra issues, I've also set up a Travis CI build at [2] . This test also fails there, but the error message is clearer: Caused by: javax.jcr.PathNotFoundException: /var/discovery/impl . The full build log is at [3]. I also have errors when repeatedly running the test locally, so it's not just CI flakiness. [1] : https://builds.apache.org/view/S-Z/view/Sling/job/sling-trunk-1.6/1715/console [2]: http://travis-ci.org/rombert/sling/ [3]: https://s3.amazonaws.com/archive.travis-ci.org/jobs/8746894/log.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (SLING-2945) ClusterLoadTest failures
[ https://issues.apache.org/jira/browse/SLING-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13700515#comment-13700515 ] Robert Munteanu commented on SLING-2945: [~egli] - any ideas about that would be causing the failures? ClusterLoadTest failures Key: SLING-2945 URL: https://issues.apache.org/jira/browse/SLING-2945 Project: Sling Issue Type: Bug Components: Extensions Reporter: Robert Munteanu This test has failed recently on Jenkins at [1], with the message: --- T E S T S --- Running org.apache.sling.discovery.impl.cluster.ClusterLoadTest Build timed out (after 150 minutes). Marking the build as failed. Results : Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 To debug infra issues, I've also set up a Travis CI build at [2] . This test also fails there, but the error message is clearer: Caused by: javax.jcr.PathNotFoundException: /var/discovery/impl . The full build log is at [3]. I also have errors when repeatedly running the test locally, so it's not just CI flakiness. [1] : https://builds.apache.org/view/S-Z/view/Sling/job/sling-trunk-1.6/1715/console [2]: http://travis-ci.org/rombert/sling/ [3]: https://s3.amazonaws.com/archive.travis-ci.org/jobs/8746894/log.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (SLING-2945) ClusterLoadTest failures
[ https://issues.apache.org/jira/browse/SLING-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13700612#comment-13700612 ] Stefan Egli commented on SLING-2945: First suspicion was that the tests interfere with each other: a shutdown of the former to be still running while a new test starts - and the shutdown would clear the repository while it was being used/created by the new testrun. To avoid this, I changed the tests so that each testrun uses a separate path in the repository. There still seems to be a problem though - that could be related to a deadlock. Following up on that one .. ClusterLoadTest failures Key: SLING-2945 URL: https://issues.apache.org/jira/browse/SLING-2945 Project: Sling Issue Type: Bug Components: Extensions Reporter: Robert Munteanu Assignee: Stefan Egli This test has failed recently on Jenkins at [1], with the message: --- T E S T S --- Running org.apache.sling.discovery.impl.cluster.ClusterLoadTest Build timed out (after 150 minutes). Marking the build as failed. Results : Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 To debug infra issues, I've also set up a Travis CI build at [2] . This test also fails there, but the error message is clearer: Caused by: javax.jcr.PathNotFoundException: /var/discovery/impl . The full build log is at [3]. I also have errors when repeatedly running the test locally, so it's not just CI flakiness. [1] : https://builds.apache.org/view/S-Z/view/Sling/job/sling-trunk-1.6/1715/console [2]: http://travis-ci.org/rombert/sling/ [3]: https://s3.amazonaws.com/archive.travis-ci.org/jobs/8746894/log.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (SLING-2945) ClusterLoadTest failures
[ https://issues.apache.org/jira/browse/SLING-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13700616#comment-13700616 ] Stefan Egli commented on SLING-2945: Indeed a deadlock - might be only test-related though..: Found one Java-level deadlock: = Test-Heartbeat-Runner: waiting to lock monitor 7fb9d4882a18 (object 7f6a80980, a org.apache.sling.discovery.impl.common.heartbeat.HeartbeatHandler), which is held by main main: waiting to lock monitor 7fb9d4881710 (object 7f6bcaff8, a java.lang.Object), which is held by Test-Heartbeat-Runner Java stack information for the threads listed above: === Test-Heartbeat-Runner: at org.apache.sling.discovery.impl.common.heartbeat.HeartbeatHandler.doCheckView(HeartbeatHandler.java:454) - waiting to lock 7f6a80980 (a org.apache.sling.discovery.impl.common.heartbeat.HeartbeatHandler) at org.apache.sling.discovery.impl.common.heartbeat.HeartbeatHandler.checkView(HeartbeatHandler.java:381) at org.apache.sling.discovery.impl.common.heartbeat.HeartbeatHandler.run(HeartbeatHandler.java:190) - locked 7f6bcaff8 (a java.lang.Object) at org.apache.sling.discovery.impl.setup.Instance.runHeartbeatOnce(Instance.java:310) at org.apache.sling.discovery.impl.setup.Instance$HeartbeatRunner.run(Instance.java:138) at java.lang.Thread.run(Thread.java:680) main: at org.apache.sling.discovery.impl.common.heartbeat.HeartbeatHandler.activate(HeartbeatHandler.java:126) - waiting to lock 7f6bcaff8 (a java.lang.Object) - locked 7f6a80980 (a org.apache.sling.discovery.impl.common.heartbeat.HeartbeatHandler) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.sling.discovery.impl.setup.OSGiMock.activate(OSGiMock.java:71) at org.apache.sling.discovery.impl.setup.Instance.startHeartbeats(Instance.java:322) at org.apache.sling.discovery.impl.cluster.ClusterLoadTest.doTest(ClusterLoadTest.java:155) at org.apache.sling.discovery.impl.cluster.ClusterLoadTest.testFiveInstances(ClusterLoadTest.java:82) ClusterLoadTest failures Key: SLING-2945 URL: https://issues.apache.org/jira/browse/SLING-2945 Project: Sling Issue Type: Bug Components: Extensions Reporter: Robert Munteanu Assignee: Stefan Egli This test has failed recently on Jenkins at [1], with the message: --- T E S T S --- Running org.apache.sling.discovery.impl.cluster.ClusterLoadTest Build timed out (after 150 minutes). Marking the build as failed. Results : Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 To debug infra issues, I've also set up a Travis CI build at [2] . This test also fails there, but the error message is clearer: Caused by: javax.jcr.PathNotFoundException: /var/discovery/impl . The full build log is at [3]. I also have errors when repeatedly running the test locally, so it's not just CI flakiness. [1] : https://builds.apache.org/view/S-Z/view/Sling/job/sling-trunk-1.6/1715/console [2]: http://travis-ci.org/rombert/sling/ [3]: https://s3.amazonaws.com/archive.travis-ci.org/jobs/8746894/log.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (SLING-2945) ClusterLoadTest failures
[ https://issues.apache.org/jira/browse/SLING-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13700661#comment-13700661 ] Robert Munteanu commented on SLING-2945: [~egli] I see that the ClusterLoadTest still fails on Jenkins - [1] and on Travis - [2] . The sling-trunk-1.6 jobs passes now though. I'm going to make a 10-execution run myself and see if they pass or fail. for run in $(seq 1 10); do echo Run #$run mvn clean verify -q -pl :org.apache.sling.discovery.impl -Dtest=org.apache.sling.discovery.impl.cluster.ClusterLoadTest; done [1]: https://builds.apache.org/job/sling-trunk-1.7/org.apache.sling$org.apache.sling.discovery.impl/85/testReport/ [2]: https://s3.amazonaws.com/archive.travis-ci.org/jobs/8765149/log.txt ClusterLoadTest failures Key: SLING-2945 URL: https://issues.apache.org/jira/browse/SLING-2945 Project: Sling Issue Type: Bug Components: Extensions Reporter: Robert Munteanu Assignee: Stefan Egli This test has failed recently on Jenkins at [1], with the message: --- T E S T S --- Running org.apache.sling.discovery.impl.cluster.ClusterLoadTest Build timed out (after 150 minutes). Marking the build as failed. Results : Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 To debug infra issues, I've also set up a Travis CI build at [2] . This test also fails there, but the error message is clearer: Caused by: javax.jcr.PathNotFoundException: /var/discovery/impl . The full build log is at [3]. I also have errors when repeatedly running the test locally, so it's not just CI flakiness. [1] : https://builds.apache.org/view/S-Z/view/Sling/job/sling-trunk-1.6/1715/console [2]: http://travis-ci.org/rombert/sling/ [3]: https://s3.amazonaws.com/archive.travis-ci.org/jobs/8746894/log.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (SLING-2945) ClusterLoadTest failures
[ https://issues.apache.org/jira/browse/SLING-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13700667#comment-13700667 ] Stefan Egli commented on SLING-2945: indeed - I'll have another look at the tests! :S ClusterLoadTest failures Key: SLING-2945 URL: https://issues.apache.org/jira/browse/SLING-2945 Project: Sling Issue Type: Bug Components: Extensions Reporter: Robert Munteanu Assignee: Stefan Egli This test has failed recently on Jenkins at [1], with the message: --- T E S T S --- Running org.apache.sling.discovery.impl.cluster.ClusterLoadTest Build timed out (after 150 minutes). Marking the build as failed. Results : Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 To debug infra issues, I've also set up a Travis CI build at [2] . This test also fails there, but the error message is clearer: Caused by: javax.jcr.PathNotFoundException: /var/discovery/impl . The full build log is at [3]. I also have errors when repeatedly running the test locally, so it's not just CI flakiness. [1] : https://builds.apache.org/view/S-Z/view/Sling/job/sling-trunk-1.6/1715/console [2]: http://travis-ci.org/rombert/sling/ [3]: https://s3.amazonaws.com/archive.travis-ci.org/jobs/8746894/log.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (SLING-2945) ClusterLoadTest failures
[ https://issues.apache.org/jira/browse/SLING-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13700928#comment-13700928 ] Robert Munteanu commented on SLING-2945: At least Travis tests are now passing [1],[2],[3], so it's definitely a sleep/delay problem. The Jenkins builds are still running. Perhaps you can use something like org.apache.sling.testing.tools.retry.RetryLoop with a large delay, to make the wait more flexible, and also faster in case the Condition is met earlier? [1]: https://api.travis-ci.org/jobs/8767899/log.txt?deansi=true [2]: https://s3.amazonaws.com/archive.travis-ci.org/jobs/8767898/log.txt [3]: https://api.travis-ci.org/jobs/8767900/log.txt?deansi=true ClusterLoadTest failures Key: SLING-2945 URL: https://issues.apache.org/jira/browse/SLING-2945 Project: Sling Issue Type: Bug Components: Extensions Reporter: Robert Munteanu Assignee: Stefan Egli This test has failed recently on Jenkins at [1], with the message: --- T E S T S --- Running org.apache.sling.discovery.impl.cluster.ClusterLoadTest Build timed out (after 150 minutes). Marking the build as failed. Results : Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 To debug infra issues, I've also set up a Travis CI build at [2] . This test also fails there, but the error message is clearer: Caused by: javax.jcr.PathNotFoundException: /var/discovery/impl . The full build log is at [3]. I also have errors when repeatedly running the test locally, so it's not just CI flakiness. [1] : https://builds.apache.org/view/S-Z/view/Sling/job/sling-trunk-1.6/1715/console [2]: http://travis-ci.org/rombert/sling/ [3]: https://s3.amazonaws.com/archive.travis-ci.org/jobs/8746894/log.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (SLING-2945) ClusterLoadTest failures
[ https://issues.apache.org/jira/browse/SLING-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13700943#comment-13700943 ] Stefan Egli commented on SLING-2945: Jenkins test re discovery.impl ran through successfully now (#88, [0]). Keeping this ticket open to continue monitoring jenkins results in the next days and to follow-up on [~rombert]'s idea of using RetryLoop. [0]: https://builds.apache.org/job/sling-trunk-1.7/88/org.apache.sling$org.apache.sling.discovery.impl/ ClusterLoadTest failures Key: SLING-2945 URL: https://issues.apache.org/jira/browse/SLING-2945 Project: Sling Issue Type: Bug Components: Extensions Reporter: Robert Munteanu Assignee: Stefan Egli This test has failed recently on Jenkins at [1], with the message: --- T E S T S --- Running org.apache.sling.discovery.impl.cluster.ClusterLoadTest Build timed out (after 150 minutes). Marking the build as failed. Results : Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 To debug infra issues, I've also set up a Travis CI build at [2] . This test also fails there, but the error message is clearer: Caused by: javax.jcr.PathNotFoundException: /var/discovery/impl . The full build log is at [3]. I also have errors when repeatedly running the test locally, so it's not just CI flakiness. [1] : https://builds.apache.org/view/S-Z/view/Sling/job/sling-trunk-1.6/1715/console [2]: http://travis-ci.org/rombert/sling/ [3]: https://s3.amazonaws.com/archive.travis-ci.org/jobs/8746894/log.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira