[jira] [Created] (FLINK-27756) Fix Intermittingly failing test in `AsyncSinkWriterTest`
Ahmed Hamdy created FLINK-27756: --- Summary: Fix Intermittingly failing test in `AsyncSinkWriterTest` Key: FLINK-27756 URL: https://issues.apache.org/jira/browse/FLINK-27756 Project: Flink Issue Type: Sub-task Components: Connectors / Kinesis Affects Versions: 1.15.0 Reporter: Ahmed Hamdy Assignee: Ahmed Hamdy Fix For: 1.15.0 h2. Motivation - Add documentation for the kinesis firehose table api feature. - Add user guide and configuration list. -- This message was sent by Atlassian Jira (v8.20.7#820007)
Re: Failing Test
Thanks, the actual problem is that the ActorSystem gets shutdown. This breaks the testing code. Should be fixed once https://github.com/apache/flink/pull/1852 is merged. On Tue, Apr 5, 2016 at 12:25 PM, Matthias J. Saxwrote: > Happened again after your fix: > https://travis-ci.org/apache/flink/jobs/120620482 > > -Matthias > > > On 04/01/2016 08:57 PM, Maximilian Michels wrote: >> Fixed with the resolution of >> https://issues.apache.org/jira/browse/FLINK-3689. >> >> On Fri, Apr 1, 2016 at 12:40 PM, Maximilian Michels wrote: >>> Hi Matthias, >>> >>> Thanks for spotting the test failure. It's actually a bug in the code >>> and not a test problem. Fixing it. >>> >>> Cheers, >>> Max >>> >>> On Fri, Apr 1, 2016 at 9:33 AM, Ufuk Celebi wrote: Hey Matthias, the test has been only recently added with the resource management refactoring. It's probably just a too aggressive timeout for Travis. @Max: Did you ever see this fail? – Ufuk On Fri, Apr 1, 2016 at 9:24 AM, Matthias J. Sax wrote: > Anyone seen this before? One-time thing or test instability? > >> ClusterShutdownITCase.testClusterShutdown:71 assertion failed: timeout >> (29848225634 nanoseconds) during expectMsgClass waiting for class >> org.apache.flink.runtime.clusterframework.messages.StopClusterSuccessful > > > -Matthias > >
Re: Failing Test
Happened again after your fix: https://travis-ci.org/apache/flink/jobs/120620482 -Matthias On 04/01/2016 08:57 PM, Maximilian Michels wrote: > Fixed with the resolution of https://issues.apache.org/jira/browse/FLINK-3689. > > On Fri, Apr 1, 2016 at 12:40 PM, Maximilian Michelswrote: >> Hi Matthias, >> >> Thanks for spotting the test failure. It's actually a bug in the code >> and not a test problem. Fixing it. >> >> Cheers, >> Max >> >> On Fri, Apr 1, 2016 at 9:33 AM, Ufuk Celebi wrote: >>> Hey Matthias, >>> >>> the test has been only recently added with the resource management >>> refactoring. It's probably just a too aggressive timeout for Travis. >>> >>> @Max: Did you ever see this fail? >>> >>> – Ufuk >>> >>> On Fri, Apr 1, 2016 at 9:24 AM, Matthias J. Sax wrote: Anyone seen this before? One-time thing or test instability? > ClusterShutdownITCase.testClusterShutdown:71 assertion failed: timeout > (29848225634 nanoseconds) during expectMsgClass waiting for class > org.apache.flink.runtime.clusterframework.messages.StopClusterSuccessful -Matthias signature.asc Description: OpenPGP digital signature
Re: Failing Test
Thanks. Just tried is out and it works :) On 04/01/2016 08:57 PM, Maximilian Michels wrote: > Fixed with the resolution of https://issues.apache.org/jira/browse/FLINK-3689. > > On Fri, Apr 1, 2016 at 12:40 PM, Maximilian Michelswrote: >> Hi Matthias, >> >> Thanks for spotting the test failure. It's actually a bug in the code >> and not a test problem. Fixing it. >> >> Cheers, >> Max >> >> On Fri, Apr 1, 2016 at 9:33 AM, Ufuk Celebi wrote: >>> Hey Matthias, >>> >>> the test has been only recently added with the resource management >>> refactoring. It's probably just a too aggressive timeout for Travis. >>> >>> @Max: Did you ever see this fail? >>> >>> – Ufuk >>> >>> On Fri, Apr 1, 2016 at 9:24 AM, Matthias J. Sax wrote: Anyone seen this before? One-time thing or test instability? > ClusterShutdownITCase.testClusterShutdown:71 assertion failed: timeout > (29848225634 nanoseconds) during expectMsgClass waiting for class > org.apache.flink.runtime.clusterframework.messages.StopClusterSuccessful -Matthias signature.asc Description: OpenPGP digital signature
Re: Failing Test
Fixed with the resolution of https://issues.apache.org/jira/browse/FLINK-3689. On Fri, Apr 1, 2016 at 12:40 PM, Maximilian Michelswrote: > Hi Matthias, > > Thanks for spotting the test failure. It's actually a bug in the code > and not a test problem. Fixing it. > > Cheers, > Max > > On Fri, Apr 1, 2016 at 9:33 AM, Ufuk Celebi wrote: >> Hey Matthias, >> >> the test has been only recently added with the resource management >> refactoring. It's probably just a too aggressive timeout for Travis. >> >> @Max: Did you ever see this fail? >> >> – Ufuk >> >> On Fri, Apr 1, 2016 at 9:24 AM, Matthias J. Sax wrote: >>> Anyone seen this before? One-time thing or test instability? >>> ClusterShutdownITCase.testClusterShutdown:71 assertion failed: timeout (29848225634 nanoseconds) during expectMsgClass waiting for class org.apache.flink.runtime.clusterframework.messages.StopClusterSuccessful >>> >>> >>> -Matthias >>>
Re: Failing Test
Hi Matthias, Thanks for spotting the test failure. It's actually a bug in the code and not a test problem. Fixing it. Cheers, Max On Fri, Apr 1, 2016 at 9:33 AM, Ufuk Celebiwrote: > Hey Matthias, > > the test has been only recently added with the resource management > refactoring. It's probably just a too aggressive timeout for Travis. > > @Max: Did you ever see this fail? > > – Ufuk > > On Fri, Apr 1, 2016 at 9:24 AM, Matthias J. Sax wrote: >> Anyone seen this before? One-time thing or test instability? >> >>> ClusterShutdownITCase.testClusterShutdown:71 assertion failed: timeout >>> (29848225634 nanoseconds) during expectMsgClass waiting for class >>> org.apache.flink.runtime.clusterframework.messages.StopClusterSuccessful >> >> >> -Matthias >>
Re: Failing Test
Hey Matthias, the test has been only recently added with the resource management refactoring. It's probably just a too aggressive timeout for Travis. @Max: Did you ever see this fail? – Ufuk On Fri, Apr 1, 2016 at 9:24 AM, Matthias J. Saxwrote: > Anyone seen this before? One-time thing or test instability? > >> ClusterShutdownITCase.testClusterShutdown:71 assertion failed: timeout >> (29848225634 nanoseconds) during expectMsgClass waiting for class >> org.apache.flink.runtime.clusterframework.messages.StopClusterSuccessful > > > -Matthias >
Failing Test
Anyone seen this before? One-time thing or test instability? > ClusterShutdownITCase.testClusterShutdown:71 assertion failed: timeout > (29848225634 nanoseconds) during expectMsgClass waiting for class > org.apache.flink.runtime.clusterframework.messages.StopClusterSuccessful -Matthias signature.asc Description: OpenPGP digital signature
[jira] [Created] (FLINK-2839) Failing test: OperatorStatsAccumulatorTest.testAccumulatorAllStatistics
Gabor Gevay created FLINK-2839: -- Summary: Failing test: OperatorStatsAccumulatorTest.testAccumulatorAllStatistics Key: FLINK-2839 URL: https://issues.apache.org/jira/browse/FLINK-2839 Project: Flink Issue Type: Bug Components: flink-contrib Reporter: Gabor Gevay Priority: Minor I saw this test failure: {code} Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.633 sec <<< FAILURE! - in org.apache.flink.contrib.operatorstatistics.OperatorStatsAccumulatorTest testAccumulatorAllStatistics(org.apache.flink.contrib.operatorstatistics.OperatorStatsAccumulatorTest) Time elapsed: 1.5 sec <<< FAILURE! java.lang.AssertionError: The total number of heavy hitters should be between 0 and 5. at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.flink.contrib.operatorstatistics.OperatorStatsAccumulatorTest.testAccumulatorAllStatistics(OperatorStatsAccumulatorTest.java:151) {code} Full log [here|https://s3.amazonaws.com/archive.travis-ci.org/jobs/84469788/log.txt]. Maybe the test should set a constant seed to the {{Random}} object. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2832) Failing test: RandomSamplerTest.testReservoirSamplerWithReplacement
Vasia Kalavri created FLINK-2832: Summary: Failing test: RandomSamplerTest.testReservoirSamplerWithReplacement Key: FLINK-2832 URL: https://issues.apache.org/jira/browse/FLINK-2832 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 0.10 Reporter: Vasia Kalavri Priority: Critical Fix For: 0.10 Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 19.133 sec <<< FAILURE! - in org.apache.flink.api.java.sampling.RandomSamplerTest testReservoirSamplerWithReplacement(org.apache.flink.api.java.sampling.RandomSamplerTest) Time elapsed: 2.534 sec <<< FAILURE! java.lang.AssertionError: KS test result with p value(0.11), d value(0.103090) at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.flink.api.java.sampling.RandomSamplerTest.verifyKSTest(RandomSamplerTest.java:342) at org.apache.flink.api.java.sampling.RandomSamplerTest.verifyRandomSamplerWithSampleSize(RandomSamplerTest.java:330) at org.apache.flink.api.java.sampling.RandomSamplerTest.verifyReservoirSamplerWithReplacement(RandomSamplerTest.java:289) at org.apache.flink.api.java.sampling.RandomSamplerTest.testReservoirSamplerWithReplacement(RandomSamplerTest.java:192) Results : Failed tests: RandomSamplerTest.testReservoirSamplerWithReplacement:192->verifyReservoirSamplerWithReplacement:289->verifyRandomSamplerWithSampleSize:330->verifyKSTest:342 KS test result with p value(0.11), d value(0.103090) Full log [here|https://travis-ci.org/apache/flink/jobs/84120131]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Failing test
If there is none yet, then we do. Label it with "test-stability". I think the consensus was also to mark it as critical. Otherwise, just add the log to the JIRA. On Tue, Oct 6, 2015 at 2:57 PM, Matthias J. Saxwrote: > Hi, > > One test just failed on current master: > https://travis-ci.org/apache/flink/jobs/83871008 > > Do we need a JIRA? > > > LeaderChangeStateCleanupTest.testReelectionOfSameJobManager:245 » > Timeout Futu... > > > -Matthias > >
Failing test
Hi, One test just failed on current master: https://travis-ci.org/apache/flink/jobs/83871008 Do we need a JIRA? > LeaderChangeStateCleanupTest.testReelectionOfSameJobManager:245 » Timeout > Futu... -Matthias signature.asc Description: OpenPGP digital signature
[jira] [Created] (FLINK-2628) Failing Test: StreamFaultToleranceTestBase.runCheckpointedProgram
Martin Liesenberg created FLINK-2628: Summary: Failing Test: StreamFaultToleranceTestBase.runCheckpointedProgram Key: FLINK-2628 URL: https://issues.apache.org/jira/browse/FLINK-2628 Project: Flink Issue Type: Bug Components: Tests Reporter: Martin Liesenberg In pullrequest #1097 the test StreamFaultToleranceTestBase.runCheckpointedProgram The changes introduced in the pull request are most likely unrelated. I can not reproduce it locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Failing Test: KafkaITCase and KafkaProducerITCase
I have a patch pending that should help with these timeout issues (and null checks)... On Mon, Sep 7, 2015 at 2:41 PM, Matthias J. Saxwrote: > Please lock here: > > https://travis-ci.org/apache/flink/jobs/79086396 > > > Failed tests: > > KafkaITCase>KafkaTestBase.prepare:155 Test setup failed: Unable to > connect to zookeeper server within timeout: 6000 > > KafkaProducerITCase>KafkaTestBase.prepare:155 Test setup failed: Unable > to connect to zookeeper server within timeout: 6000 > > > > Tests in error: > > KafkaITCase>KafkaTestBase.shutDownServices:196 » NullPointer > > KafkaProducerITCase>KafkaTestBase.shutDownServices:196 » NullPointer > > I did not find any JIRA for it. > > > -Matthias > >
[jira] [Created] (FLINK-2599) Failing Test: SlotCountExceedingParallelismTest
Matthias J. Sax created FLINK-2599: -- Summary: Failing Test: SlotCountExceedingParallelismTest Key: FLINK-2599 URL: https://issues.apache.org/jira/browse/FLINK-2599 Project: Flink Issue Type: Bug Components: Tests Reporter: Matthias J. Sax Priority: Critical {noformat} Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 200.355 sec <<< FAILURE! - in org.apache.flink.runtime.jobmanager.SlotCountExceedingParallelismTest org.apache.flink.runtime.jobmanager.SlotCountExceedingParallelismTest Time elapsed: 200.355 sec <<< ERROR! java.util.concurrent.TimeoutException: Futures timed out after [20 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:95) at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:95) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.ready(package.scala:95) at org.apache.flink.runtime.minicluster.FlinkMiniCluster.waitForTaskManagersToBeRegistered(FlinkMiniCluster.scala:237) at org.apache.flink.runtime.minicluster.FlinkMiniCluster.(FlinkMiniCluster.scala:95) at org.apache.flink.runtime.testingUtils.TestingCluster.(TestingCluster.scala:43) at org.apache.flink.runtime.testingUtils.TestingCluster.(TestingCluster.scala:51) at org.apache.flink.runtime.testingUtils.TestingCluster.(TestingCluster.scala:56) at org.apache.flink.runtime.testingUtils.TestingUtils$.startTestingCluster(TestingUtils.scala:65) at org.apache.flink.runtime.testingUtils.TestingUtils.startTestingCluster(TestingUtils.scala) at org.apache.flink.runtime.jobmanager.SlotCountExceedingParallelismTest.setUp(SlotCountExceedingParallelismTest.java:49) org.apache.flink.runtime.jobmanager.SlotCountExceedingParallelismTest Time elapsed: 200.355 sec <<< ERROR! java.lang.NullPointerException: null at org.apache.flink.runtime.jobmanager.SlotCountExceedingParallelismTest.tearDown(SlotCountExceedingParallelismTest.java:57) {noformat} https://travis-ci.org/apache/flink/jobs/77887433 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2596) Failing Test: RandomSamplerTest
Matthias J. Sax created FLINK-2596: -- Summary: Failing Test: RandomSamplerTest Key: FLINK-2596 URL: https://issues.apache.org/jira/browse/FLINK-2596 Project: Flink Issue Type: Bug Reporter: Matthias J. Sax Priority: Critical {noformat} Tests run: 17, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 14.925 sec FAILURE! - in org.apache.flink.api.java.sampling.RandomSamplerTest testReservoirSamplerWithMultiSourcePartitions2(org.apache.flink.api.java.sampling.RandomSamplerTest) Time elapsed: 0.444 sec ERROR! java.lang.IllegalArgumentException: Comparison method violates its general contract! at java.util.TimSort.mergeLo(TimSort.java:747) at java.util.TimSort.mergeAt(TimSort.java:483) at java.util.TimSort.mergeCollapse(TimSort.java:410) at java.util.TimSort.sort(TimSort.java:214) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.flink.api.java.sampling.RandomSamplerTest.transferFromListToArrayWithOrder(RandomSamplerTest.java:375) at org.apache.flink.api.java.sampling.RandomSamplerTest.getSampledOutput(RandomSamplerTest.java:367) at org.apache.flink.api.java.sampling.RandomSamplerTest.verifyKSTest(RandomSamplerTest.java:338) at org.apache.flink.api.java.sampling.RandomSamplerTest.verifyRandomSamplerWithSampleSize(RandomSamplerTest.java:330) at org.apache.flink.api.java.sampling.RandomSamplerTest.verifyReservoirSamplerWithReplacement(RandomSamplerTest.java:290) at org.apache.flink.api.java.sampling.RandomSamplerTest.testReservoirSamplerWithMultiSourcePartitions2(RandomSamplerTest.java:212) Results : Tests in error: RandomSamplerTest.testReservoirSamplerWithMultiSourcePartitions2:212-verifyReservoirSamplerWithReplacement:290-verifyRandomSamplerWithSampleSize:330-verifyKSTest:338-getSampledOutput:367-transferFromListToArrayWithOrder:375 » IllegalArgument {noformat} https://travis-ci.org/apache/flink/jobs/77750329 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [FAILING TEST] StateCheckpoinedITCase
Pushed a fix for the StateCheckpointedITCase On Mon, Aug 24, 2015 at 12:19 PM, Maximilian Michels m...@apache.org wrote: +1 for labeling the JIRAs with test-stability. On Sat, Aug 22, 2015 at 8:21 PM, Márton Balassi balassi.mar...@gmail.com wrote: +1 for Vasia's suggestion On Aug 22, 2015 8:07 PM, Vasiliki Kalavri vasilikikala...@gmail.com wrote: I just came across 2 more :/ I'm also in favor of tracking these with JIRA. How about test-stability for a label? -V. On 21 August 2015 at 12:47, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I like the idea with the special label. Otherwise, it will be difficult to find the correct tickets. -Matthias On 08/21/2015 12:15 PM, Till Rohrmann wrote: I'm also in favor of JIRA, because I fear that nobody will keep the wiki page in sync. Maybe we can assign a special label for test stability to these JIRA issues. Then we can quickly find all currently instable test cases. On Fri, Aug 21, 2015 at 11:02 AM, Robert Metzger rmetz...@apache.org wrote: I agree that we should look for a solution other than opening a lot of small discussion threads on the mailing list. When I have a test failure, I usually search my gmail inbox to see whether somebody else wrote something about the error already. Creating a JIRA for each failing test might be a better approach. Because that's what bugtrackers are made for ;) (And the issues still pop up when doing a gmail search) On Thu, Aug 20, 2015 at 10:16 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Thanks for the info. Over the weeks I lost track which errors/failing/instable tests are know an which not. Should we start a wiki page or similar to collect know errors? If a test fails on a know error, it can just be ignored. This would avoid spam on the mailing list. Any thoughts about this? -Matthias On 08/20/2015 10:08 PM, Robert Metzger wrote: Sachin saw the error as well, as reported here: https://issues.apache.org/jira/browse/FLINK-2468 I also see it from time to time.I have a wip branch where I relaxed the constraints for the test to pass a bit. On Thu, Aug 20, 2015 at 10:05 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Error message is: Failed tests: StateCheckpoinedITCaseStreamFaultToleranceTestBase.runCheckpointedProgram:103-postSubmit:98 Test inconclusive: failure occurred before first checkpoint See: https://travis-ci.org/mjsax/flink/jobs/76483093 -Matthias
Re: [FAILING TEST] StateCheckpoinedITCase
+1 for a test-stability label and labeling these issues as critical On Mon, Aug 24, 2015 at 6:31 PM, Stephan Ewen se...@apache.org wrote: Pushed a fix for the StateCheckpointedITCase On Mon, Aug 24, 2015 at 12:19 PM, Maximilian Michels m...@apache.org wrote: +1 for labeling the JIRAs with test-stability. On Sat, Aug 22, 2015 at 8:21 PM, Márton Balassi balassi.mar...@gmail.com wrote: +1 for Vasia's suggestion On Aug 22, 2015 8:07 PM, Vasiliki Kalavri vasilikikala...@gmail.com wrote: I just came across 2 more :/ I'm also in favor of tracking these with JIRA. How about test-stability for a label? -V. On 21 August 2015 at 12:47, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I like the idea with the special label. Otherwise, it will be difficult to find the correct tickets. -Matthias On 08/21/2015 12:15 PM, Till Rohrmann wrote: I'm also in favor of JIRA, because I fear that nobody will keep the wiki page in sync. Maybe we can assign a special label for test stability to these JIRA issues. Then we can quickly find all currently instable test cases. On Fri, Aug 21, 2015 at 11:02 AM, Robert Metzger rmetz...@apache.org wrote: I agree that we should look for a solution other than opening a lot of small discussion threads on the mailing list. When I have a test failure, I usually search my gmail inbox to see whether somebody else wrote something about the error already. Creating a JIRA for each failing test might be a better approach. Because that's what bugtrackers are made for ;) (And the issues still pop up when doing a gmail search) On Thu, Aug 20, 2015 at 10:16 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Thanks for the info. Over the weeks I lost track which errors/failing/instable tests are know an which not. Should we start a wiki page or similar to collect know errors? If a test fails on a know error, it can just be ignored. This would avoid spam on the mailing list. Any thoughts about this? -Matthias On 08/20/2015 10:08 PM, Robert Metzger wrote: Sachin saw the error as well, as reported here: https://issues.apache.org/jira/browse/FLINK-2468 I also see it from time to time.I have a wip branch where I relaxed the constraints for the test to pass a bit. On Thu, Aug 20, 2015 at 10:05 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Error message is: Failed tests: StateCheckpoinedITCaseStreamFaultToleranceTestBase.runCheckpointedProgram:103-postSubmit:98 Test inconclusive: failure occurred before first checkpoint See: https://travis-ci.org/mjsax/flink/jobs/76483093 -Matthias
Re: [FAILING TEST] StateCheckpoinedITCase
+1 for labeling the JIRAs with test-stability. On Sat, Aug 22, 2015 at 8:21 PM, Márton Balassi balassi.mar...@gmail.com wrote: +1 for Vasia's suggestion On Aug 22, 2015 8:07 PM, Vasiliki Kalavri vasilikikala...@gmail.com wrote: I just came across 2 more :/ I'm also in favor of tracking these with JIRA. How about test-stability for a label? -V. On 21 August 2015 at 12:47, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I like the idea with the special label. Otherwise, it will be difficult to find the correct tickets. -Matthias On 08/21/2015 12:15 PM, Till Rohrmann wrote: I'm also in favor of JIRA, because I fear that nobody will keep the wiki page in sync. Maybe we can assign a special label for test stability to these JIRA issues. Then we can quickly find all currently instable test cases. On Fri, Aug 21, 2015 at 11:02 AM, Robert Metzger rmetz...@apache.org wrote: I agree that we should look for a solution other than opening a lot of small discussion threads on the mailing list. When I have a test failure, I usually search my gmail inbox to see whether somebody else wrote something about the error already. Creating a JIRA for each failing test might be a better approach. Because that's what bugtrackers are made for ;) (And the issues still pop up when doing a gmail search) On Thu, Aug 20, 2015 at 10:16 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Thanks for the info. Over the weeks I lost track which errors/failing/instable tests are know an which not. Should we start a wiki page or similar to collect know errors? If a test fails on a know error, it can just be ignored. This would avoid spam on the mailing list. Any thoughts about this? -Matthias On 08/20/2015 10:08 PM, Robert Metzger wrote: Sachin saw the error as well, as reported here: https://issues.apache.org/jira/browse/FLINK-2468 I also see it from time to time.I have a wip branch where I relaxed the constraints for the test to pass a bit. On Thu, Aug 20, 2015 at 10:05 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Error message is: Failed tests: StateCheckpoinedITCaseStreamFaultToleranceTestBase.runCheckpointedProgram:103-postSubmit:98 Test inconclusive: failure occurred before first checkpoint See: https://travis-ci.org/mjsax/flink/jobs/76483093 -Matthias
Re: [FAILING TEST] RandomSamplerTest
Hi Matthias, Thanks for reporting. The label test-stability exists now. Cheers, Max On Sun, Aug 23, 2015 at 12:32 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, because there is (not yet) a label for failing tests, I just report it over the mailing list again. I also open a JIRA for it). Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 15.943 sec FAILURE! - in org.apache.flink.api.java.sampling. testPoissonSamplerFraction(org.apache.flink.api.java.sampling.RandomSamplerTest) Time elapsed: 0.017 sec FAILURE! java.lang.AssertionError: expected fraction: 0.01, result fraction: 0.011300 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.flink.api.java.sampling.RandomSamplerTest.verifySamplerFraction(RandomSamplerTest.java:249) at org.apache.flink.api.java.sampling.RandomSamplerTest.testPoissonSamplerFraction(RandomSamplerTest.java:116) Results : Failed tests: Successfully installed excon-0.33.0 RandomSamplerTest.testPoissonSamplerFraction:116-verifySamplerFraction:249 expected fraction: 0.01, result fraction: 0.011300 https://travis-ci.org/apache/flink/jobs/76720572 -Matthias
[jira] [Created] (FLINK-2564) Failing Test: RandomSamplerTest
Matthias J. Sax created FLINK-2564: -- Summary: Failing Test: RandomSamplerTest Key: FLINK-2564 URL: https://issues.apache.org/jira/browse/FLINK-2564 Project: Flink Issue Type: Bug Reporter: Matthias J. Sax {noformat} Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 15.943 sec FAILURE! - in org.apache.flink.api.java.sampling. testPoissonSamplerFraction(org.apache.flink.api.java.sampling.RandomSamplerTest) Time elapsed: 0.017 sec FAILURE! java.lang.AssertionError: expected fraction: 0.01, result fraction: 0.011300 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.flink.api.java.sampling.RandomSamplerTest.verifySamplerFraction(RandomSamplerTest.java:249) at org.apache.flink.api.java.sampling.RandomSamplerTest.testPoissonSamplerFraction(RandomSamplerTest.java:116) Results : Failed tests: Successfully installed excon-0.33.0 RandomSamplerTest.testPoissonSamplerFraction:116-verifySamplerFraction:249 expected fraction: 0.01, result fraction: 0.011300 {noformat} Full log: https://travis-ci.org/apache/flink/jobs/76720572 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[FAILING TEST] RandomSamplerTest
Hi, because there is (not yet) a label for failing tests, I just report it over the mailing list again. I also open a JIRA for it). Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 15.943 sec FAILURE! - in org.apache.flink.api.java.sampling. testPoissonSamplerFraction(org.apache.flink.api.java.sampling.RandomSamplerTest) Time elapsed: 0.017 sec FAILURE! java.lang.AssertionError: expected fraction: 0.01, result fraction: 0.011300 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.flink.api.java.sampling.RandomSamplerTest.verifySamplerFraction(RandomSamplerTest.java:249) at org.apache.flink.api.java.sampling.RandomSamplerTest.testPoissonSamplerFraction(RandomSamplerTest.java:116) Results : Failed tests: Successfully installed excon-0.33.0 RandomSamplerTest.testPoissonSamplerFraction:116-verifySamplerFraction:249 expected fraction: 0.01, result fraction: 0.011300 https://travis-ci.org/apache/flink/jobs/76720572 -Matthias signature.asc Description: OpenPGP digital signature
Re: [FAILING TEST] StateCheckpoinedITCase
I just came across 2 more :/ I'm also in favor of tracking these with JIRA. How about test-stability for a label? -V. On 21 August 2015 at 12:47, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I like the idea with the special label. Otherwise, it will be difficult to find the correct tickets. -Matthias On 08/21/2015 12:15 PM, Till Rohrmann wrote: I'm also in favor of JIRA, because I fear that nobody will keep the wiki page in sync. Maybe we can assign a special label for test stability to these JIRA issues. Then we can quickly find all currently instable test cases. On Fri, Aug 21, 2015 at 11:02 AM, Robert Metzger rmetz...@apache.org wrote: I agree that we should look for a solution other than opening a lot of small discussion threads on the mailing list. When I have a test failure, I usually search my gmail inbox to see whether somebody else wrote something about the error already. Creating a JIRA for each failing test might be a better approach. Because that's what bugtrackers are made for ;) (And the issues still pop up when doing a gmail search) On Thu, Aug 20, 2015 at 10:16 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Thanks for the info. Over the weeks I lost track which errors/failing/instable tests are know an which not. Should we start a wiki page or similar to collect know errors? If a test fails on a know error, it can just be ignored. This would avoid spam on the mailing list. Any thoughts about this? -Matthias On 08/20/2015 10:08 PM, Robert Metzger wrote: Sachin saw the error as well, as reported here: https://issues.apache.org/jira/browse/FLINK-2468 I also see it from time to time.I have a wip branch where I relaxed the constraints for the test to pass a bit. On Thu, Aug 20, 2015 at 10:05 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Error message is: Failed tests: StateCheckpoinedITCaseStreamFaultToleranceTestBase.runCheckpointedProgram:103-postSubmit:98 Test inconclusive: failure occurred before first checkpoint See: https://travis-ci.org/mjsax/flink/jobs/76483093 -Matthias
Re: [FAILING TEST] StateCheckpoinedITCase
+1 for Vasia's suggestion On Aug 22, 2015 8:07 PM, Vasiliki Kalavri vasilikikala...@gmail.com wrote: I just came across 2 more :/ I'm also in favor of tracking these with JIRA. How about test-stability for a label? -V. On 21 August 2015 at 12:47, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I like the idea with the special label. Otherwise, it will be difficult to find the correct tickets. -Matthias On 08/21/2015 12:15 PM, Till Rohrmann wrote: I'm also in favor of JIRA, because I fear that nobody will keep the wiki page in sync. Maybe we can assign a special label for test stability to these JIRA issues. Then we can quickly find all currently instable test cases. On Fri, Aug 21, 2015 at 11:02 AM, Robert Metzger rmetz...@apache.org wrote: I agree that we should look for a solution other than opening a lot of small discussion threads on the mailing list. When I have a test failure, I usually search my gmail inbox to see whether somebody else wrote something about the error already. Creating a JIRA for each failing test might be a better approach. Because that's what bugtrackers are made for ;) (And the issues still pop up when doing a gmail search) On Thu, Aug 20, 2015 at 10:16 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Thanks for the info. Over the weeks I lost track which errors/failing/instable tests are know an which not. Should we start a wiki page or similar to collect know errors? If a test fails on a know error, it can just be ignored. This would avoid spam on the mailing list. Any thoughts about this? -Matthias On 08/20/2015 10:08 PM, Robert Metzger wrote: Sachin saw the error as well, as reported here: https://issues.apache.org/jira/browse/FLINK-2468 I also see it from time to time.I have a wip branch where I relaxed the constraints for the test to pass a bit. On Thu, Aug 20, 2015 at 10:05 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Error message is: Failed tests: StateCheckpoinedITCaseStreamFaultToleranceTestBase.runCheckpointedProgram:103-postSubmit:98 Test inconclusive: failure occurred before first checkpoint See: https://travis-ci.org/mjsax/flink/jobs/76483093 -Matthias
Re: [FAILING TEST] StateCheckpoinedITCase
I like the idea with the special label. Otherwise, it will be difficult to find the correct tickets. -Matthias On 08/21/2015 12:15 PM, Till Rohrmann wrote: I'm also in favor of JIRA, because I fear that nobody will keep the wiki page in sync. Maybe we can assign a special label for test stability to these JIRA issues. Then we can quickly find all currently instable test cases. On Fri, Aug 21, 2015 at 11:02 AM, Robert Metzger rmetz...@apache.org wrote: I agree that we should look for a solution other than opening a lot of small discussion threads on the mailing list. When I have a test failure, I usually search my gmail inbox to see whether somebody else wrote something about the error already. Creating a JIRA for each failing test might be a better approach. Because that's what bugtrackers are made for ;) (And the issues still pop up when doing a gmail search) On Thu, Aug 20, 2015 at 10:16 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Thanks for the info. Over the weeks I lost track which errors/failing/instable tests are know an which not. Should we start a wiki page or similar to collect know errors? If a test fails on a know error, it can just be ignored. This would avoid spam on the mailing list. Any thoughts about this? -Matthias On 08/20/2015 10:08 PM, Robert Metzger wrote: Sachin saw the error as well, as reported here: https://issues.apache.org/jira/browse/FLINK-2468 I also see it from time to time.I have a wip branch where I relaxed the constraints for the test to pass a bit. On Thu, Aug 20, 2015 at 10:05 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Error message is: Failed tests: StateCheckpoinedITCaseStreamFaultToleranceTestBase.runCheckpointedProgram:103-postSubmit:98 Test inconclusive: failure occurred before first checkpoint See: https://travis-ci.org/mjsax/flink/jobs/76483093 -Matthias signature.asc Description: OpenPGP digital signature
Re: [FAILING TEST] StateCheckpoinedITCase
I agree that we should look for a solution other than opening a lot of small discussion threads on the mailing list. When I have a test failure, I usually search my gmail inbox to see whether somebody else wrote something about the error already. Creating a JIRA for each failing test might be a better approach. Because that's what bugtrackers are made for ;) (And the issues still pop up when doing a gmail search) On Thu, Aug 20, 2015 at 10:16 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Thanks for the info. Over the weeks I lost track which errors/failing/instable tests are know an which not. Should we start a wiki page or similar to collect know errors? If a test fails on a know error, it can just be ignored. This would avoid spam on the mailing list. Any thoughts about this? -Matthias On 08/20/2015 10:08 PM, Robert Metzger wrote: Sachin saw the error as well, as reported here: https://issues.apache.org/jira/browse/FLINK-2468 I also see it from time to time.I have a wip branch where I relaxed the constraints for the test to pass a bit. On Thu, Aug 20, 2015 at 10:05 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Error message is: Failed tests: StateCheckpoinedITCaseStreamFaultToleranceTestBase.runCheckpointedProgram:103-postSubmit:98 Test inconclusive: failure occurred before first checkpoint See: https://travis-ci.org/mjsax/flink/jobs/76483093 -Matthias
Re: [FAILING TEST] StateCheckpoinedITCase
I'm also in favor of JIRA, because I fear that nobody will keep the wiki page in sync. Maybe we can assign a special label for test stability to these JIRA issues. Then we can quickly find all currently instable test cases. On Fri, Aug 21, 2015 at 11:02 AM, Robert Metzger rmetz...@apache.org wrote: I agree that we should look for a solution other than opening a lot of small discussion threads on the mailing list. When I have a test failure, I usually search my gmail inbox to see whether somebody else wrote something about the error already. Creating a JIRA for each failing test might be a better approach. Because that's what bugtrackers are made for ;) (And the issues still pop up when doing a gmail search) On Thu, Aug 20, 2015 at 10:16 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Thanks for the info. Over the weeks I lost track which errors/failing/instable tests are know an which not. Should we start a wiki page or similar to collect know errors? If a test fails on a know error, it can just be ignored. This would avoid spam on the mailing list. Any thoughts about this? -Matthias On 08/20/2015 10:08 PM, Robert Metzger wrote: Sachin saw the error as well, as reported here: https://issues.apache.org/jira/browse/FLINK-2468 I also see it from time to time.I have a wip branch where I relaxed the constraints for the test to pass a bit. On Thu, Aug 20, 2015 at 10:05 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Error message is: Failed tests: StateCheckpoinedITCaseStreamFaultToleranceTestBase.runCheckpointedProgram:103-postSubmit:98 Test inconclusive: failure occurred before first checkpoint See: https://travis-ci.org/mjsax/flink/jobs/76483093 -Matthias
[FAILING TEST] StateCheckpoinedITCase
Error message is: Failed tests: StateCheckpoinedITCaseStreamFaultToleranceTestBase.runCheckpointedProgram:103-postSubmit:98 Test inconclusive: failure occurred before first checkpoint See: https://travis-ci.org/mjsax/flink/jobs/76483093 -Matthias signature.asc Description: OpenPGP digital signature
Re: [FAILING TEST] StateCheckpoinedITCase
Sachin saw the error as well, as reported here: https://issues.apache.org/jira/browse/FLINK-2468 I also see it from time to time.I have a wip branch where I relaxed the constraints for the test to pass a bit. On Thu, Aug 20, 2015 at 10:05 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Error message is: Failed tests: StateCheckpoinedITCaseStreamFaultToleranceTestBase.runCheckpointedProgram:103-postSubmit:98 Test inconclusive: failure occurred before first checkpoint See: https://travis-ci.org/mjsax/flink/jobs/76483093 -Matthias
Re: [FAILING TEST] StateCheckpoinedITCase
Thanks for the info. Over the weeks I lost track which errors/failing/instable tests are know an which not. Should we start a wiki page or similar to collect know errors? If a test fails on a know error, it can just be ignored. This would avoid spam on the mailing list. Any thoughts about this? -Matthias On 08/20/2015 10:08 PM, Robert Metzger wrote: Sachin saw the error as well, as reported here: https://issues.apache.org/jira/browse/FLINK-2468 I also see it from time to time.I have a wip branch where I relaxed the constraints for the test to pass a bit. On Thu, Aug 20, 2015 at 10:05 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Error message is: Failed tests: StateCheckpoinedITCaseStreamFaultToleranceTestBase.runCheckpointedProgram:103-postSubmit:98 Test inconclusive: failure occurred before first checkpoint See: https://travis-ci.org/mjsax/flink/jobs/76483093 -Matthias signature.asc Description: OpenPGP digital signature
Re: [FAILING TEST] BlobLibraryCacheManagerTest
Looks like a rare race between the cleanup (two changes) and the test validating both changes. I'll push a fix to make the test more reliable. On Sun, Aug 16, 2015 at 11:04 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, I hit a failing test in flink-runtime. Not sure if it is known already: Failed tests: CheckpointCoordinatorTest.testCheckpointTimeoutIsolated:594 expected:0 but was:1 Please see: https://travis-ci.org/mjsax/flink/jobs/75847501 -Matthias
[FAILING TEST] BlobLibraryCacheManagerTest
Hi, I hit a failing test in flink-runtime. Not sure if it is known already: Failed tests: CheckpointCoordinatorTest.testCheckpointTimeoutIsolated:594 expected:0 but was:1 Please see: https://travis-ci.org/mjsax/flink/jobs/75847501 -Matthias signature.asc Description: OpenPGP digital signature
Re: Failing test in Gelly
May be an issue with the embedded YARN mini cluster... On Mon, Aug 10, 2015 at 8:37 PM, Stephan Ewen se...@apache.org wrote: I think the YARN problem is as before, but with a longer timeout. Before, when after 60 seconds the expected output did not come, the tests aborted. The timeout is now 180 seconds, which is probably so long that the deadlock detector (5 minutes no output) kicks in. In any case, there is something broken, because the YARN program does not properly finish. On Sun, Aug 9, 2015 at 9:49 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Not sure about the yarn test... As yarn was instable all the time I just ignored it... -Matthias On 08/09/2015 09:38 PM, Ufuk Celebi wrote: PS what about the yarn test case... Is that one known (with that trace)? On Sunday, August 9, 2015, Ufuk Celebi u...@apache.org wrote: There is an issue for this from last week. Couldn't look into it last week, will do tomorrow. Thanks for the logs. :) On Sunday, August 9, 2015, Matthias J. Sax mj...@informatik.hu-berlin.de javascript:_e(%7B%7D,'cvml','mj...@informatik.hu-berlin.de'); wrote: Wrong link... sorry. https://travis-ci.org/mjsax/flink/jobs/74787655 On 08/09/2015 04:02 PM, Maximilian Michels wrote: Hi Matthias, Is that the correct build URL? I can't spot any failing Gelly tests. The build appears to be stuck in the YARNSessionFIFOITCase. Cheers, Max On Sun, Aug 9, 2015 at 3:37 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, I got a new failing test in this build (flink-gelly) https://travis-ci.org/mjsax/flink/jobs/74787658 The branch is basically the current master, as I only fixed documentation stuff in this PR. -Matthias
Re: Failing test in Gelly
I think the YARN problem is as before, but with a longer timeout. Before, when after 60 seconds the expected output did not come, the tests aborted. The timeout is now 180 seconds, which is probably so long that the deadlock detector (5 minutes no output) kicks in. In any case, there is something broken, because the YARN program does not properly finish. On Sun, Aug 9, 2015 at 9:49 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Not sure about the yarn test... As yarn was instable all the time I just ignored it... -Matthias On 08/09/2015 09:38 PM, Ufuk Celebi wrote: PS what about the yarn test case... Is that one known (with that trace)? On Sunday, August 9, 2015, Ufuk Celebi u...@apache.org wrote: There is an issue for this from last week. Couldn't look into it last week, will do tomorrow. Thanks for the logs. :) On Sunday, August 9, 2015, Matthias J. Sax mj...@informatik.hu-berlin.de javascript:_e(%7B%7D,'cvml','mj...@informatik.hu-berlin.de'); wrote: Wrong link... sorry. https://travis-ci.org/mjsax/flink/jobs/74787655 On 08/09/2015 04:02 PM, Maximilian Michels wrote: Hi Matthias, Is that the correct build URL? I can't spot any failing Gelly tests. The build appears to be stuck in the YARNSessionFIFOITCase. Cheers, Max On Sun, Aug 9, 2015 at 3:37 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, I got a new failing test in this build (flink-gelly) https://travis-ci.org/mjsax/flink/jobs/74787658 The branch is basically the current master, as I only fixed documentation stuff in this PR. -Matthias
Failing test in Gelly
Hi, I got a new failing test in this build (flink-gelly) https://travis-ci.org/mjsax/flink/jobs/74787658 The branch is basically the current master, as I only fixed documentation stuff in this PR. -Matthias signature.asc Description: OpenPGP digital signature
Re: Failing Test again
I've assigned https://issues.apache.org/jira/browse/FLINK-1680 to myself. Maybe Tachyon 0.7 will fix the issues. On Tue, Aug 4, 2015 at 1:57 PM, Stephan Ewen se...@apache.org wrote: Yes. We should know, though, whether this is a Java 6 bug, or a bug in our system that just happens to occur only with Java 6 (because of different timings in this other engine) On Tue, Aug 4, 2015 at 12:27 PM, Chesnay Schepler chesnay.schep...@fu-berlin.de wrote: Aren't we dropping java 6 support? On 04.08.2015 12:21, Stephan Ewen wrote: The StateCheckpointedITCase has not failed so far, which also test these guarantees thoroughly. But we need to first rule out the BarrierBuffer. The problem is that the bug occur only on Java 6 and cannot be reproduced locally... On Tue, Aug 4, 2015 at 12:14 PM, Gyula Fóra gyula.f...@gmail.com wrote: Honestly I don't think the partitioned state changes have anything to do with the stability, only the reworked test case, which now test proper exactly-once which was missing before. Stephan Ewen se...@apache.org ezt írta (időpont: 2015. aug. 4., K, 12:12): Yes, the build stability is super serious right now. Here are the problems in question, and what we could do about this: BarrierBuffer: Barrier Buffer tests fail in Java 6 builds. I have not found a way to diagnose that problem, yet, but if we cannot find the issue today, I would be willing to revert my latest commits on the barrier buffer to increase the stability. StreamCheckpointingITCase --- This seems to have started with either the barrier buffer, or the updated partitioned state. If fixing/reverting the barrier buffer does not fix it, and no fix has come up until then, let's revert the latest changes to the partitioned state and re-add them when they are stable. Tachyon: - The Tachyon mini cluster has a problem, apparently, the programs exit with a sysexit or segfault. Since we have no Tachyon code ourselves, do we need this test as part of the nightly tests? Can we make this a manual test that we trigger on demand? Greetings, Stephan On Tue, Aug 4, 2015 at 11:41 AM, Aljoscha Krettek aljos...@apache.org wrote: I've also seen this fail: https://travis-ci.org/apache/flink/jobs/74025862 in SuccessAfterNetworkBuffersFailureITCase Build seems quite flaky recently. On Tue, 4 Aug 2015 at 10:27 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Rebased on: https://github.com/mjsax/flink/commit/fab61a1954ff1554448e826e1d273689ed520fc3 But if the gap between two rebases is large, it's hard to say what the problem might be... The old parent commit (ie, rebase before last rebase) was https://github.com/mjsax/flink/commit/148395bcd81a93bcb1473e4e93f267edb3b71c7e -Matthias On 08/04/2015 08:57 AM, Aljoscha Krettek wrote: What are the commits that you rebased on? Could you maybe narrow down what caused the regression? On Mon, 3 Aug 2015 at 23:31 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I only report failing tests after a rebase. ;) -Matthias On 08/03/2015 11:23 PM, Henry Saputra wrote: Thanks for reporting it , Matthias. Will try to run Travis for latest Flink. Tachyon test is a bit flaky. Maybe updating to latest release could help. - Henry On Mon, Aug 3, 2015 at 2:18 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Today, not a single built was successful completely. Please see here: Flink Streaming Core: https://travis-ci.org/mjsax/flink/jobs/73938109 https://travis-ci.org/mjsax/flink/jobs/73951362 https://travis-ci.org/apache/flink/jobs/73938124 https://travis-ci.org/apache/flink/jobs/73899795 https://travis-ci.org/apache/flink/jobs/73938122 https://travis-ci.org/apache/flink/jobs/73952441 Flink Taychon: https://travis-ci.org/apache/flink/jobs/73938123 -Matthias
Re: Failing Test again
I've also seen the BufferSpillerTest fail: https://travis-ci.org/apache/flink/jobs/74057503 On Tue, 4 Aug 2015 at 14:10 Robert Metzger rmetz...@apache.org wrote: I've assigned https://issues.apache.org/jira/browse/FLINK-1680 to myself. Maybe Tachyon 0.7 will fix the issues. On Tue, Aug 4, 2015 at 1:57 PM, Stephan Ewen se...@apache.org wrote: Yes. We should know, though, whether this is a Java 6 bug, or a bug in our system that just happens to occur only with Java 6 (because of different timings in this other engine) On Tue, Aug 4, 2015 at 12:27 PM, Chesnay Schepler chesnay.schep...@fu-berlin.de wrote: Aren't we dropping java 6 support? On 04.08.2015 12:21, Stephan Ewen wrote: The StateCheckpointedITCase has not failed so far, which also test these guarantees thoroughly. But we need to first rule out the BarrierBuffer. The problem is that the bug occur only on Java 6 and cannot be reproduced locally... On Tue, Aug 4, 2015 at 12:14 PM, Gyula Fóra gyula.f...@gmail.com wrote: Honestly I don't think the partitioned state changes have anything to do with the stability, only the reworked test case, which now test proper exactly-once which was missing before. Stephan Ewen se...@apache.org ezt írta (időpont: 2015. aug. 4., K, 12:12): Yes, the build stability is super serious right now. Here are the problems in question, and what we could do about this: BarrierBuffer: Barrier Buffer tests fail in Java 6 builds. I have not found a way to diagnose that problem, yet, but if we cannot find the issue today, I would be willing to revert my latest commits on the barrier buffer to increase the stability. StreamCheckpointingITCase --- This seems to have started with either the barrier buffer, or the updated partitioned state. If fixing/reverting the barrier buffer does not fix it, and no fix has come up until then, let's revert the latest changes to the partitioned state and re-add them when they are stable. Tachyon: - The Tachyon mini cluster has a problem, apparently, the programs exit with a sysexit or segfault. Since we have no Tachyon code ourselves, do we need this test as part of the nightly tests? Can we make this a manual test that we trigger on demand? Greetings, Stephan On Tue, Aug 4, 2015 at 11:41 AM, Aljoscha Krettek aljos...@apache.org wrote: I've also seen this fail: https://travis-ci.org/apache/flink/jobs/74025862 in SuccessAfterNetworkBuffersFailureITCase Build seems quite flaky recently. On Tue, 4 Aug 2015 at 10:27 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Rebased on: https://github.com/mjsax/flink/commit/fab61a1954ff1554448e826e1d273689ed520fc3 But if the gap between two rebases is large, it's hard to say what the problem might be... The old parent commit (ie, rebase before last rebase) was https://github.com/mjsax/flink/commit/148395bcd81a93bcb1473e4e93f267edb3b71c7e -Matthias On 08/04/2015 08:57 AM, Aljoscha Krettek wrote: What are the commits that you rebased on? Could you maybe narrow down what caused the regression? On Mon, 3 Aug 2015 at 23:31 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I only report failing tests after a rebase. ;) -Matthias On 08/03/2015 11:23 PM, Henry Saputra wrote: Thanks for reporting it , Matthias. Will try to run Travis for latest Flink. Tachyon test is a bit flaky. Maybe updating to latest release could help. - Henry On Mon, Aug 3, 2015 at 2:18 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Today, not a single built was successful completely. Please see here: Flink Streaming Core: https://travis-ci.org/mjsax/flink/jobs/73938109 https://travis-ci.org/mjsax/flink/jobs/73951362 https://travis-ci.org/apache/flink/jobs/73938124 https://travis-ci.org/apache/flink/jobs/73899795 https://travis-ci.org/apache/flink/jobs/73938122 https://travis-ci.org/apache/flink/jobs/73952441 Flink Taychon: https://travis-ci.org/apache/flink/jobs/73938123 -Matthias
Re: Failing Test again
Rebased on: https://github.com/mjsax/flink/commit/fab61a1954ff1554448e826e1d273689ed520fc3 But if the gap between two rebases is large, it's hard to say what the problem might be... The old parent commit (ie, rebase before last rebase) was https://github.com/mjsax/flink/commit/148395bcd81a93bcb1473e4e93f267edb3b71c7e -Matthias On 08/04/2015 08:57 AM, Aljoscha Krettek wrote: What are the commits that you rebased on? Could you maybe narrow down what caused the regression? On Mon, 3 Aug 2015 at 23:31 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I only report failing tests after a rebase. ;) -Matthias On 08/03/2015 11:23 PM, Henry Saputra wrote: Thanks for reporting it , Matthias. Will try to run Travis for latest Flink. Tachyon test is a bit flaky. Maybe updating to latest release could help. - Henry On Mon, Aug 3, 2015 at 2:18 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Today, not a single built was successful completely. Please see here: Flink Streaming Core: https://travis-ci.org/mjsax/flink/jobs/73938109 https://travis-ci.org/mjsax/flink/jobs/73951362 https://travis-ci.org/apache/flink/jobs/73938124 https://travis-ci.org/apache/flink/jobs/73899795 https://travis-ci.org/apache/flink/jobs/73938122 https://travis-ci.org/apache/flink/jobs/73952441 Flink Taychon: https://travis-ci.org/apache/flink/jobs/73938123 -Matthias signature.asc Description: OpenPGP digital signature
Re: Failing Test again
Yes. We should know, though, whether this is a Java 6 bug, or a bug in our system that just happens to occur only with Java 6 (because of different timings in this other engine) On Tue, Aug 4, 2015 at 12:27 PM, Chesnay Schepler chesnay.schep...@fu-berlin.de wrote: Aren't we dropping java 6 support? On 04.08.2015 12:21, Stephan Ewen wrote: The StateCheckpointedITCase has not failed so far, which also test these guarantees thoroughly. But we need to first rule out the BarrierBuffer. The problem is that the bug occur only on Java 6 and cannot be reproduced locally... On Tue, Aug 4, 2015 at 12:14 PM, Gyula Fóra gyula.f...@gmail.com wrote: Honestly I don't think the partitioned state changes have anything to do with the stability, only the reworked test case, which now test proper exactly-once which was missing before. Stephan Ewen se...@apache.org ezt írta (időpont: 2015. aug. 4., K, 12:12): Yes, the build stability is super serious right now. Here are the problems in question, and what we could do about this: BarrierBuffer: Barrier Buffer tests fail in Java 6 builds. I have not found a way to diagnose that problem, yet, but if we cannot find the issue today, I would be willing to revert my latest commits on the barrier buffer to increase the stability. StreamCheckpointingITCase --- This seems to have started with either the barrier buffer, or the updated partitioned state. If fixing/reverting the barrier buffer does not fix it, and no fix has come up until then, let's revert the latest changes to the partitioned state and re-add them when they are stable. Tachyon: - The Tachyon mini cluster has a problem, apparently, the programs exit with a sysexit or segfault. Since we have no Tachyon code ourselves, do we need this test as part of the nightly tests? Can we make this a manual test that we trigger on demand? Greetings, Stephan On Tue, Aug 4, 2015 at 11:41 AM, Aljoscha Krettek aljos...@apache.org wrote: I've also seen this fail: https://travis-ci.org/apache/flink/jobs/74025862 in SuccessAfterNetworkBuffersFailureITCase Build seems quite flaky recently. On Tue, 4 Aug 2015 at 10:27 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Rebased on: https://github.com/mjsax/flink/commit/fab61a1954ff1554448e826e1d273689ed520fc3 But if the gap between two rebases is large, it's hard to say what the problem might be... The old parent commit (ie, rebase before last rebase) was https://github.com/mjsax/flink/commit/148395bcd81a93bcb1473e4e93f267edb3b71c7e -Matthias On 08/04/2015 08:57 AM, Aljoscha Krettek wrote: What are the commits that you rebased on? Could you maybe narrow down what caused the regression? On Mon, 3 Aug 2015 at 23:31 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I only report failing tests after a rebase. ;) -Matthias On 08/03/2015 11:23 PM, Henry Saputra wrote: Thanks for reporting it , Matthias. Will try to run Travis for latest Flink. Tachyon test is a bit flaky. Maybe updating to latest release could help. - Henry On Mon, Aug 3, 2015 at 2:18 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Today, not a single built was successful completely. Please see here: Flink Streaming Core: https://travis-ci.org/mjsax/flink/jobs/73938109 https://travis-ci.org/mjsax/flink/jobs/73951362 https://travis-ci.org/apache/flink/jobs/73938124 https://travis-ci.org/apache/flink/jobs/73899795 https://travis-ci.org/apache/flink/jobs/73938122 https://travis-ci.org/apache/flink/jobs/73952441 Flink Taychon: https://travis-ci.org/apache/flink/jobs/73938123 -Matthias
Re: Failing Test again
Yes, the build stability is super serious right now. Here are the problems in question, and what we could do about this: BarrierBuffer: Barrier Buffer tests fail in Java 6 builds. I have not found a way to diagnose that problem, yet, but if we cannot find the issue today, I would be willing to revert my latest commits on the barrier buffer to increase the stability. StreamCheckpointingITCase --- This seems to have started with either the barrier buffer, or the updated partitioned state. If fixing/reverting the barrier buffer does not fix it, and no fix has come up until then, let's revert the latest changes to the partitioned state and re-add them when they are stable. Tachyon: - The Tachyon mini cluster has a problem, apparently, the programs exit with a sysexit or segfault. Since we have no Tachyon code ourselves, do we need this test as part of the nightly tests? Can we make this a manual test that we trigger on demand? Greetings, Stephan On Tue, Aug 4, 2015 at 11:41 AM, Aljoscha Krettek aljos...@apache.org wrote: I've also seen this fail: https://travis-ci.org/apache/flink/jobs/74025862 in SuccessAfterNetworkBuffersFailureITCase Build seems quite flaky recently. On Tue, 4 Aug 2015 at 10:27 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Rebased on: https://github.com/mjsax/flink/commit/fab61a1954ff1554448e826e1d273689ed520fc3 But if the gap between two rebases is large, it's hard to say what the problem might be... The old parent commit (ie, rebase before last rebase) was https://github.com/mjsax/flink/commit/148395bcd81a93bcb1473e4e93f267edb3b71c7e -Matthias On 08/04/2015 08:57 AM, Aljoscha Krettek wrote: What are the commits that you rebased on? Could you maybe narrow down what caused the regression? On Mon, 3 Aug 2015 at 23:31 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I only report failing tests after a rebase. ;) -Matthias On 08/03/2015 11:23 PM, Henry Saputra wrote: Thanks for reporting it , Matthias. Will try to run Travis for latest Flink. Tachyon test is a bit flaky. Maybe updating to latest release could help. - Henry On Mon, Aug 3, 2015 at 2:18 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Today, not a single built was successful completely. Please see here: Flink Streaming Core: https://travis-ci.org/mjsax/flink/jobs/73938109 https://travis-ci.org/mjsax/flink/jobs/73951362 https://travis-ci.org/apache/flink/jobs/73938124 https://travis-ci.org/apache/flink/jobs/73899795 https://travis-ci.org/apache/flink/jobs/73938122 https://travis-ci.org/apache/flink/jobs/73952441 Flink Taychon: https://travis-ci.org/apache/flink/jobs/73938123 -Matthias
Re: Failing Test again
Aren't we dropping java 6 support? On 04.08.2015 12:21, Stephan Ewen wrote: The StateCheckpointedITCase has not failed so far, which also test these guarantees thoroughly. But we need to first rule out the BarrierBuffer. The problem is that the bug occur only on Java 6 and cannot be reproduced locally... On Tue, Aug 4, 2015 at 12:14 PM, Gyula Fóra gyula.f...@gmail.com wrote: Honestly I don't think the partitioned state changes have anything to do with the stability, only the reworked test case, which now test proper exactly-once which was missing before. Stephan Ewen se...@apache.org ezt írta (időpont: 2015. aug. 4., K, 12:12): Yes, the build stability is super serious right now. Here are the problems in question, and what we could do about this: BarrierBuffer: Barrier Buffer tests fail in Java 6 builds. I have not found a way to diagnose that problem, yet, but if we cannot find the issue today, I would be willing to revert my latest commits on the barrier buffer to increase the stability. StreamCheckpointingITCase --- This seems to have started with either the barrier buffer, or the updated partitioned state. If fixing/reverting the barrier buffer does not fix it, and no fix has come up until then, let's revert the latest changes to the partitioned state and re-add them when they are stable. Tachyon: - The Tachyon mini cluster has a problem, apparently, the programs exit with a sysexit or segfault. Since we have no Tachyon code ourselves, do we need this test as part of the nightly tests? Can we make this a manual test that we trigger on demand? Greetings, Stephan On Tue, Aug 4, 2015 at 11:41 AM, Aljoscha Krettek aljos...@apache.org wrote: I've also seen this fail: https://travis-ci.org/apache/flink/jobs/74025862 in SuccessAfterNetworkBuffersFailureITCase Build seems quite flaky recently. On Tue, 4 Aug 2015 at 10:27 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Rebased on: https://github.com/mjsax/flink/commit/fab61a1954ff1554448e826e1d273689ed520fc3 But if the gap between two rebases is large, it's hard to say what the problem might be... The old parent commit (ie, rebase before last rebase) was https://github.com/mjsax/flink/commit/148395bcd81a93bcb1473e4e93f267edb3b71c7e -Matthias On 08/04/2015 08:57 AM, Aljoscha Krettek wrote: What are the commits that you rebased on? Could you maybe narrow down what caused the regression? On Mon, 3 Aug 2015 at 23:31 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I only report failing tests after a rebase. ;) -Matthias On 08/03/2015 11:23 PM, Henry Saputra wrote: Thanks for reporting it , Matthias. Will try to run Travis for latest Flink. Tachyon test is a bit flaky. Maybe updating to latest release could help. - Henry On Mon, Aug 3, 2015 at 2:18 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Today, not a single built was successful completely. Please see here: Flink Streaming Core: https://travis-ci.org/mjsax/flink/jobs/73938109 https://travis-ci.org/mjsax/flink/jobs/73951362 https://travis-ci.org/apache/flink/jobs/73938124 https://travis-ci.org/apache/flink/jobs/73899795 https://travis-ci.org/apache/flink/jobs/73938122 https://travis-ci.org/apache/flink/jobs/73952441 Flink Taychon: https://travis-ci.org/apache/flink/jobs/73938123 -Matthias
Re: Failing Test again
Honestly I don't think the partitioned state changes have anything to do with the stability, only the reworked test case, which now test proper exactly-once which was missing before. Stephan Ewen se...@apache.org ezt írta (időpont: 2015. aug. 4., K, 12:12): Yes, the build stability is super serious right now. Here are the problems in question, and what we could do about this: BarrierBuffer: Barrier Buffer tests fail in Java 6 builds. I have not found a way to diagnose that problem, yet, but if we cannot find the issue today, I would be willing to revert my latest commits on the barrier buffer to increase the stability. StreamCheckpointingITCase --- This seems to have started with either the barrier buffer, or the updated partitioned state. If fixing/reverting the barrier buffer does not fix it, and no fix has come up until then, let's revert the latest changes to the partitioned state and re-add them when they are stable. Tachyon: - The Tachyon mini cluster has a problem, apparently, the programs exit with a sysexit or segfault. Since we have no Tachyon code ourselves, do we need this test as part of the nightly tests? Can we make this a manual test that we trigger on demand? Greetings, Stephan On Tue, Aug 4, 2015 at 11:41 AM, Aljoscha Krettek aljos...@apache.org wrote: I've also seen this fail: https://travis-ci.org/apache/flink/jobs/74025862 in SuccessAfterNetworkBuffersFailureITCase Build seems quite flaky recently. On Tue, 4 Aug 2015 at 10:27 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Rebased on: https://github.com/mjsax/flink/commit/fab61a1954ff1554448e826e1d273689ed520fc3 But if the gap between two rebases is large, it's hard to say what the problem might be... The old parent commit (ie, rebase before last rebase) was https://github.com/mjsax/flink/commit/148395bcd81a93bcb1473e4e93f267edb3b71c7e -Matthias On 08/04/2015 08:57 AM, Aljoscha Krettek wrote: What are the commits that you rebased on? Could you maybe narrow down what caused the regression? On Mon, 3 Aug 2015 at 23:31 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I only report failing tests after a rebase. ;) -Matthias On 08/03/2015 11:23 PM, Henry Saputra wrote: Thanks for reporting it , Matthias. Will try to run Travis for latest Flink. Tachyon test is a bit flaky. Maybe updating to latest release could help. - Henry On Mon, Aug 3, 2015 at 2:18 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Today, not a single built was successful completely. Please see here: Flink Streaming Core: https://travis-ci.org/mjsax/flink/jobs/73938109 https://travis-ci.org/mjsax/flink/jobs/73951362 https://travis-ci.org/apache/flink/jobs/73938124 https://travis-ci.org/apache/flink/jobs/73899795 https://travis-ci.org/apache/flink/jobs/73938122 https://travis-ci.org/apache/flink/jobs/73952441 Flink Taychon: https://travis-ci.org/apache/flink/jobs/73938123 -Matthias
Re: Failing Test again
I've also seen this fail: https://travis-ci.org/apache/flink/jobs/74025862 in SuccessAfterNetworkBuffersFailureITCase Build seems quite flaky recently. On Tue, 4 Aug 2015 at 10:27 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Rebased on: https://github.com/mjsax/flink/commit/fab61a1954ff1554448e826e1d273689ed520fc3 But if the gap between two rebases is large, it's hard to say what the problem might be... The old parent commit (ie, rebase before last rebase) was https://github.com/mjsax/flink/commit/148395bcd81a93bcb1473e4e93f267edb3b71c7e -Matthias On 08/04/2015 08:57 AM, Aljoscha Krettek wrote: What are the commits that you rebased on? Could you maybe narrow down what caused the regression? On Mon, 3 Aug 2015 at 23:31 Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I only report failing tests after a rebase. ;) -Matthias On 08/03/2015 11:23 PM, Henry Saputra wrote: Thanks for reporting it , Matthias. Will try to run Travis for latest Flink. Tachyon test is a bit flaky. Maybe updating to latest release could help. - Henry On Mon, Aug 3, 2015 at 2:18 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Today, not a single built was successful completely. Please see here: Flink Streaming Core: https://travis-ci.org/mjsax/flink/jobs/73938109 https://travis-ci.org/mjsax/flink/jobs/73951362 https://travis-ci.org/apache/flink/jobs/73938124 https://travis-ci.org/apache/flink/jobs/73899795 https://travis-ci.org/apache/flink/jobs/73938122 https://travis-ci.org/apache/flink/jobs/73952441 Flink Taychon: https://travis-ci.org/apache/flink/jobs/73938123 -Matthias
Failing Test
Hi, I just hit a failing test (https://travis-ci.org/apache/flink/jobs/73899795). It is know or new? Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 86.929 sec FAILURE! - in org.apache.flink.test.checkpointing.StreamCheckpointingITCase runCheckpointedProgram(org.apache.flink.test.checkpointing.StreamCheckpointingITCase) Time elapsed: 77.945 sec FAILURE! java.lang.AssertionError: expected:25 but was:0 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.flink.test.checkpointing.StreamCheckpointingITCase.runCheckpointedProgram(StreamCheckpointingITCase.java:164) -Matthias signature.asc Description: OpenPGP digital signature
Failing Test again
Today, not a single built was successful completely. Please see here: Flink Streaming Core: https://travis-ci.org/mjsax/flink/jobs/73938109 https://travis-ci.org/mjsax/flink/jobs/73951362 https://travis-ci.org/apache/flink/jobs/73938124 https://travis-ci.org/apache/flink/jobs/73899795 https://travis-ci.org/apache/flink/jobs/73938122 https://travis-ci.org/apache/flink/jobs/73952441 Flink Taychon: https://travis-ci.org/apache/flink/jobs/73938123 -Matthias signature.asc Description: OpenPGP digital signature
Re: Failing Test again
Thanks for reporting it , Matthias. Will try to run Travis for latest Flink. Tachyon test is a bit flaky. Maybe updating to latest release could help. - Henry On Mon, Aug 3, 2015 at 2:18 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Today, not a single built was successful completely. Please see here: Flink Streaming Core: https://travis-ci.org/mjsax/flink/jobs/73938109 https://travis-ci.org/mjsax/flink/jobs/73951362 https://travis-ci.org/apache/flink/jobs/73938124 https://travis-ci.org/apache/flink/jobs/73899795 https://travis-ci.org/apache/flink/jobs/73938122 https://travis-ci.org/apache/flink/jobs/73952441 Flink Taychon: https://travis-ci.org/apache/flink/jobs/73938123 -Matthias
Re: Failing Test again
I only report failing tests after a rebase. ;) -Matthias On 08/03/2015 11:23 PM, Henry Saputra wrote: Thanks for reporting it , Matthias. Will try to run Travis for latest Flink. Tachyon test is a bit flaky. Maybe updating to latest release could help. - Henry On Mon, Aug 3, 2015 at 2:18 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Today, not a single built was successful completely. Please see here: Flink Streaming Core: https://travis-ci.org/mjsax/flink/jobs/73938109 https://travis-ci.org/mjsax/flink/jobs/73951362 https://travis-ci.org/apache/flink/jobs/73938124 https://travis-ci.org/apache/flink/jobs/73899795 https://travis-ci.org/apache/flink/jobs/73938122 https://travis-ci.org/apache/flink/jobs/73952441 Flink Taychon: https://travis-ci.org/apache/flink/jobs/73938123 -Matthias signature.asc Description: OpenPGP digital signature
Re: Failing Test
Seen this a few times as well. May be something with the latest partitioned state changes... On Mon, Aug 3, 2015 at 5:48 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, I just hit a failing test (https://travis-ci.org/apache/flink/jobs/73899795). It is know or new? Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 86.929 sec FAILURE! - in org.apache.flink.test.checkpointing.StreamCheckpointingITCase runCheckpointedProgram(org.apache.flink.test.checkpointing.StreamCheckpointingITCase) Time elapsed: 77.945 sec FAILURE! java.lang.AssertionError: expected:25 but was:0 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.flink.test.checkpointing.StreamCheckpointingITCase.runCheckpointedProgram(StreamCheckpointingITCase.java:164) -Matthias
Re: Failing Test
Thanks Matthias for overlooking the issue. Thank you Till for the problem formulation and the suggested steps for solving the synchronization problem. I will look into this as soon as possible. Cheers, Max On Fri, Jul 17, 2015 at 11:18 AM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I will open an JIRA for this. It's getting complicated. On 07/17/2015 11:04 AM, Till Rohrmann wrote: I think the problem might be related to the way the test is constructed. The test submits a job to the JM and then tries to poll the accumulators from the JM. If it does not succeed, then the polling is retried with an decreasing pause in between. Furthermore, the task which updates the accumulators also sleeps for the same period until it reads the next element and updates the accumulators. Since the test does not use an explicit synchronization but instead relies on sleeps, it will most likely exhibit a flakey behaviour. Sleeps don't work reliable enough, especially on Travis, to guarantee a certain thread interleaving. I'd recommend introducing explicit synchronization mechanism which control the behaviour of the accumulator producing task and explicit testing messages which indicate that a new accumulator value has arrived at the JM. Cheers, Till On Thu, Jul 16, 2015 at 11:04 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, the test still fails. This time in both runs (Flink Travis and my own Travis) -- only for Java 8 again: https://travis-ci.org/apache/flink/jobs/71314132 https://travis-ci.org/mjsax/flink/jobs/71179608 -Matthias On 07/16/2015 02:28 PM, Matthias J. Sax wrote: Great! I will. As 4 of 5 runs succeeded I cannot test explicitly. Will have an eye on it in future runs. -Matthias On 07/16/2015 02:24 PM, Maximilian Michels wrote: Hi Matthias, I've pushed a fix to the master. The problem should be solved. Please tell me if your Travis reports an error again. My Travis never complained :) Cheers, Max On Thu, Jul 16, 2015 at 12:00 PM, Maximilian Michels m...@apache.org wrote: Hi Matthias, This is indeed a timing issue when checking for the results in this test. The new accumulator implementation now continuously reports from the running tasks to the job manager. This was merged yesterday. The assertion that fails there is a bit strict. Actually, I've already integrated a retry mechanism that fails only if the assertions don't hold for a configured number of times. I'll commit a fix to the master. Thanks for reporting! Cheers, Max On Thu, Jul 16, 2015 at 11:33 AM, Ufuk Celebi u...@apache.org wrote: Hey, this has been merged yesterday. I guess it's a timing issue when verifying the results. Can you file an issue for this? – Ufuk On 16 Jul 2015, at 11:30, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, I hit another failing test (that is new to me): Results : Failed tests: AccumulatorLiveITCase.testProgram:106-access$1100:68-checkFlinkAccumulators:189 null Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.694 sec FAILURE! - in org.apache.flink.test.accumulators.AccumulatorLiveITCase testProgram(org.apache.flink.test.accumulators.AccumulatorLiveITCase) Time elapsed: 8.021 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.checkFlinkAccumulators(AccumulatorLiveITCase.java:189) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.access$1100(AccumulatorLiveITCase.java:68) Please see: https://travis-ci.org/mjsax/flink/jobs/71179608 Does anyone know anything about it? BTW: Even if this test is in flink-tests, the problem seems not to be related to https://issues.apache.org/jira/browse/FLINK-2032 because accumulators are tested. There are not result files involved (as fas as I can tell). -Matthias
Re: Failing Test
I think the problem might be related to the way the test is constructed. The test submits a job to the JM and then tries to poll the accumulators from the JM. If it does not succeed, then the polling is retried with an decreasing pause in between. Furthermore, the task which updates the accumulators also sleeps for the same period until it reads the next element and updates the accumulators. Since the test does not use an explicit synchronization but instead relies on sleeps, it will most likely exhibit a flakey behaviour. Sleeps don't work reliable enough, especially on Travis, to guarantee a certain thread interleaving. I'd recommend introducing explicit synchronization mechanism which control the behaviour of the accumulator producing task and explicit testing messages which indicate that a new accumulator value has arrived at the JM. Cheers, Till On Thu, Jul 16, 2015 at 11:04 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, the test still fails. This time in both runs (Flink Travis and my own Travis) -- only for Java 8 again: https://travis-ci.org/apache/flink/jobs/71314132 https://travis-ci.org/mjsax/flink/jobs/71179608 -Matthias On 07/16/2015 02:28 PM, Matthias J. Sax wrote: Great! I will. As 4 of 5 runs succeeded I cannot test explicitly. Will have an eye on it in future runs. -Matthias On 07/16/2015 02:24 PM, Maximilian Michels wrote: Hi Matthias, I've pushed a fix to the master. The problem should be solved. Please tell me if your Travis reports an error again. My Travis never complained :) Cheers, Max On Thu, Jul 16, 2015 at 12:00 PM, Maximilian Michels m...@apache.org wrote: Hi Matthias, This is indeed a timing issue when checking for the results in this test. The new accumulator implementation now continuously reports from the running tasks to the job manager. This was merged yesterday. The assertion that fails there is a bit strict. Actually, I've already integrated a retry mechanism that fails only if the assertions don't hold for a configured number of times. I'll commit a fix to the master. Thanks for reporting! Cheers, Max On Thu, Jul 16, 2015 at 11:33 AM, Ufuk Celebi u...@apache.org wrote: Hey, this has been merged yesterday. I guess it's a timing issue when verifying the results. Can you file an issue for this? – Ufuk On 16 Jul 2015, at 11:30, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, I hit another failing test (that is new to me): Results : Failed tests: AccumulatorLiveITCase.testProgram:106-access$1100:68-checkFlinkAccumulators:189 null Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.694 sec FAILURE! - in org.apache.flink.test.accumulators.AccumulatorLiveITCase testProgram(org.apache.flink.test.accumulators.AccumulatorLiveITCase) Time elapsed: 8.021 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.checkFlinkAccumulators(AccumulatorLiveITCase.java:189) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.access$1100(AccumulatorLiveITCase.java:68) Please see: https://travis-ci.org/mjsax/flink/jobs/71179608 Does anyone know anything about it? BTW: Even if this test is in flink-tests, the problem seems not to be related to https://issues.apache.org/jira/browse/FLINK-2032 because accumulators are tested. There are not result files involved (as fas as I can tell). -Matthias
Re: Failing Test
I will open an JIRA for this. It's getting complicated. On 07/17/2015 11:04 AM, Till Rohrmann wrote: I think the problem might be related to the way the test is constructed. The test submits a job to the JM and then tries to poll the accumulators from the JM. If it does not succeed, then the polling is retried with an decreasing pause in between. Furthermore, the task which updates the accumulators also sleeps for the same period until it reads the next element and updates the accumulators. Since the test does not use an explicit synchronization but instead relies on sleeps, it will most likely exhibit a flakey behaviour. Sleeps don't work reliable enough, especially on Travis, to guarantee a certain thread interleaving. I'd recommend introducing explicit synchronization mechanism which control the behaviour of the accumulator producing task and explicit testing messages which indicate that a new accumulator value has arrived at the JM. Cheers, Till On Thu, Jul 16, 2015 at 11:04 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, the test still fails. This time in both runs (Flink Travis and my own Travis) -- only for Java 8 again: https://travis-ci.org/apache/flink/jobs/71314132 https://travis-ci.org/mjsax/flink/jobs/71179608 -Matthias On 07/16/2015 02:28 PM, Matthias J. Sax wrote: Great! I will. As 4 of 5 runs succeeded I cannot test explicitly. Will have an eye on it in future runs. -Matthias On 07/16/2015 02:24 PM, Maximilian Michels wrote: Hi Matthias, I've pushed a fix to the master. The problem should be solved. Please tell me if your Travis reports an error again. My Travis never complained :) Cheers, Max On Thu, Jul 16, 2015 at 12:00 PM, Maximilian Michels m...@apache.org wrote: Hi Matthias, This is indeed a timing issue when checking for the results in this test. The new accumulator implementation now continuously reports from the running tasks to the job manager. This was merged yesterday. The assertion that fails there is a bit strict. Actually, I've already integrated a retry mechanism that fails only if the assertions don't hold for a configured number of times. I'll commit a fix to the master. Thanks for reporting! Cheers, Max On Thu, Jul 16, 2015 at 11:33 AM, Ufuk Celebi u...@apache.org wrote: Hey, this has been merged yesterday. I guess it's a timing issue when verifying the results. Can you file an issue for this? – Ufuk On 16 Jul 2015, at 11:30, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, I hit another failing test (that is new to me): Results : Failed tests: AccumulatorLiveITCase.testProgram:106-access$1100:68-checkFlinkAccumulators:189 null Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.694 sec FAILURE! - in org.apache.flink.test.accumulators.AccumulatorLiveITCase testProgram(org.apache.flink.test.accumulators.AccumulatorLiveITCase) Time elapsed: 8.021 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.checkFlinkAccumulators(AccumulatorLiveITCase.java:189) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.access$1100(AccumulatorLiveITCase.java:68) Please see: https://travis-ci.org/mjsax/flink/jobs/71179608 Does anyone know anything about it? BTW: Even if this test is in flink-tests, the problem seems not to be related to https://issues.apache.org/jira/browse/FLINK-2032 because accumulators are tested. There are not result files involved (as fas as I can tell). -Matthias signature.asc Description: OpenPGP digital signature
Failing Test
Hi, I hit another failing test (that is new to me): Results : Failed tests: AccumulatorLiveITCase.testProgram:106-access$1100:68-checkFlinkAccumulators:189 null Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.694 sec FAILURE! - in org.apache.flink.test.accumulators.AccumulatorLiveITCase testProgram(org.apache.flink.test.accumulators.AccumulatorLiveITCase) Time elapsed: 8.021 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.checkFlinkAccumulators(AccumulatorLiveITCase.java:189) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.access$1100(AccumulatorLiveITCase.java:68) Please see: https://travis-ci.org/mjsax/flink/jobs/71179608 Does anyone know anything about it? BTW: Even if this test is in flink-tests, the problem seems not to be related to https://issues.apache.org/jira/browse/FLINK-2032 because accumulators are tested. There are not result files involved (as fas as I can tell). -Matthias signature.asc Description: OpenPGP digital signature
Re: Failing Test
Hi Matthias, This is indeed a timing issue when checking for the results in this test. The new accumulator implementation now continuously reports from the running tasks to the job manager. This was merged yesterday. The assertion that fails there is a bit strict. Actually, I've already integrated a retry mechanism that fails only if the assertions don't hold for a configured number of times. I'll commit a fix to the master. Thanks for reporting! Cheers, Max On Thu, Jul 16, 2015 at 11:33 AM, Ufuk Celebi u...@apache.org wrote: Hey, this has been merged yesterday. I guess it's a timing issue when verifying the results. Can you file an issue for this? – Ufuk On 16 Jul 2015, at 11:30, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, I hit another failing test (that is new to me): Results : Failed tests: AccumulatorLiveITCase.testProgram:106-access$1100:68-checkFlinkAccumulators:189 null Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.694 sec FAILURE! - in org.apache.flink.test.accumulators.AccumulatorLiveITCase testProgram(org.apache.flink.test.accumulators.AccumulatorLiveITCase) Time elapsed: 8.021 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.checkFlinkAccumulators(AccumulatorLiveITCase.java:189) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.access$1100(AccumulatorLiveITCase.java:68) Please see: https://travis-ci.org/mjsax/flink/jobs/71179608 Does anyone know anything about it? BTW: Even if this test is in flink-tests, the problem seems not to be related to https://issues.apache.org/jira/browse/FLINK-2032 because accumulators are tested. There are not result files involved (as fas as I can tell). -Matthias
Re: Failing Test
Hey, this has been merged yesterday. I guess it's a timing issue when verifying the results. Can you file an issue for this? – Ufuk On 16 Jul 2015, at 11:30, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, I hit another failing test (that is new to me): Results : Failed tests: AccumulatorLiveITCase.testProgram:106-access$1100:68-checkFlinkAccumulators:189 null Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.694 sec FAILURE! - in org.apache.flink.test.accumulators.AccumulatorLiveITCase testProgram(org.apache.flink.test.accumulators.AccumulatorLiveITCase) Time elapsed: 8.021 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.checkFlinkAccumulators(AccumulatorLiveITCase.java:189) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.access$1100(AccumulatorLiveITCase.java:68) Please see: https://travis-ci.org/mjsax/flink/jobs/71179608 Does anyone know anything about it? BTW: Even if this test is in flink-tests, the problem seems not to be related to https://issues.apache.org/jira/browse/FLINK-2032 because accumulators are tested. There are not result files involved (as fas as I can tell). -Matthias
Re: Failing Test
Hi Matthias, I've pushed a fix to the master. The problem should be solved. Please tell me if your Travis reports an error again. My Travis never complained :) Cheers, Max On Thu, Jul 16, 2015 at 12:00 PM, Maximilian Michels m...@apache.org wrote: Hi Matthias, This is indeed a timing issue when checking for the results in this test. The new accumulator implementation now continuously reports from the running tasks to the job manager. This was merged yesterday. The assertion that fails there is a bit strict. Actually, I've already integrated a retry mechanism that fails only if the assertions don't hold for a configured number of times. I'll commit a fix to the master. Thanks for reporting! Cheers, Max On Thu, Jul 16, 2015 at 11:33 AM, Ufuk Celebi u...@apache.org wrote: Hey, this has been merged yesterday. I guess it's a timing issue when verifying the results. Can you file an issue for this? – Ufuk On 16 Jul 2015, at 11:30, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, I hit another failing test (that is new to me): Results : Failed tests: AccumulatorLiveITCase.testProgram:106-access$1100:68-checkFlinkAccumulators:189 null Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.694 sec FAILURE! - in org.apache.flink.test.accumulators.AccumulatorLiveITCase testProgram(org.apache.flink.test.accumulators.AccumulatorLiveITCase) Time elapsed: 8.021 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.checkFlinkAccumulators(AccumulatorLiveITCase.java:189) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.access$1100(AccumulatorLiveITCase.java:68) Please see: https://travis-ci.org/mjsax/flink/jobs/71179608 Does anyone know anything about it? BTW: Even if this test is in flink-tests, the problem seems not to be related to https://issues.apache.org/jira/browse/FLINK-2032 because accumulators are tested. There are not result files involved (as fas as I can tell). -Matthias
Re: Failing Test
Great! I will. As 4 of 5 runs succeeded I cannot test explicitly. Will have an eye on it in future runs. -Matthias On 07/16/2015 02:24 PM, Maximilian Michels wrote: Hi Matthias, I've pushed a fix to the master. The problem should be solved. Please tell me if your Travis reports an error again. My Travis never complained :) Cheers, Max On Thu, Jul 16, 2015 at 12:00 PM, Maximilian Michels m...@apache.org wrote: Hi Matthias, This is indeed a timing issue when checking for the results in this test. The new accumulator implementation now continuously reports from the running tasks to the job manager. This was merged yesterday. The assertion that fails there is a bit strict. Actually, I've already integrated a retry mechanism that fails only if the assertions don't hold for a configured number of times. I'll commit a fix to the master. Thanks for reporting! Cheers, Max On Thu, Jul 16, 2015 at 11:33 AM, Ufuk Celebi u...@apache.org wrote: Hey, this has been merged yesterday. I guess it's a timing issue when verifying the results. Can you file an issue for this? – Ufuk On 16 Jul 2015, at 11:30, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, I hit another failing test (that is new to me): Results : Failed tests: AccumulatorLiveITCase.testProgram:106-access$1100:68-checkFlinkAccumulators:189 null Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.694 sec FAILURE! - in org.apache.flink.test.accumulators.AccumulatorLiveITCase testProgram(org.apache.flink.test.accumulators.AccumulatorLiveITCase) Time elapsed: 8.021 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.checkFlinkAccumulators(AccumulatorLiveITCase.java:189) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.access$1100(AccumulatorLiveITCase.java:68) Please see: https://travis-ci.org/mjsax/flink/jobs/71179608 Does anyone know anything about it? BTW: Even if this test is in flink-tests, the problem seems not to be related to https://issues.apache.org/jira/browse/FLINK-2032 because accumulators are tested. There are not result files involved (as fas as I can tell). -Matthias signature.asc Description: OpenPGP digital signature
Re: Failing Test
Hi, the test still fails. This time in both runs (Flink Travis and my own Travis) -- only for Java 8 again: https://travis-ci.org/apache/flink/jobs/71314132 https://travis-ci.org/mjsax/flink/jobs/71179608 -Matthias On 07/16/2015 02:28 PM, Matthias J. Sax wrote: Great! I will. As 4 of 5 runs succeeded I cannot test explicitly. Will have an eye on it in future runs. -Matthias On 07/16/2015 02:24 PM, Maximilian Michels wrote: Hi Matthias, I've pushed a fix to the master. The problem should be solved. Please tell me if your Travis reports an error again. My Travis never complained :) Cheers, Max On Thu, Jul 16, 2015 at 12:00 PM, Maximilian Michels m...@apache.org wrote: Hi Matthias, This is indeed a timing issue when checking for the results in this test. The new accumulator implementation now continuously reports from the running tasks to the job manager. This was merged yesterday. The assertion that fails there is a bit strict. Actually, I've already integrated a retry mechanism that fails only if the assertions don't hold for a configured number of times. I'll commit a fix to the master. Thanks for reporting! Cheers, Max On Thu, Jul 16, 2015 at 11:33 AM, Ufuk Celebi u...@apache.org wrote: Hey, this has been merged yesterday. I guess it's a timing issue when verifying the results. Can you file an issue for this? – Ufuk On 16 Jul 2015, at 11:30, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, I hit another failing test (that is new to me): Results : Failed tests: AccumulatorLiveITCase.testProgram:106-access$1100:68-checkFlinkAccumulators:189 null Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.694 sec FAILURE! - in org.apache.flink.test.accumulators.AccumulatorLiveITCase testProgram(org.apache.flink.test.accumulators.AccumulatorLiveITCase) Time elapsed: 8.021 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.checkFlinkAccumulators(AccumulatorLiveITCase.java:189) at org.apache.flink.test.accumulators.AccumulatorLiveITCase.access$1100(AccumulatorLiveITCase.java:68) Please see: https://travis-ci.org/mjsax/flink/jobs/71179608 Does anyone know anything about it? BTW: Even if this test is in flink-tests, the problem seems not to be related to https://issues.apache.org/jira/browse/FLINK-2032 because accumulators are tested. There are not result files involved (as fas as I can tell). -Matthias signature.asc Description: OpenPGP digital signature
[jira] [Created] (FLINK-2349) Instable (failing) Test
Matthias J. Sax created FLINK-2349: -- Summary: Instable (failing) Test Key: FLINK-2349 URL: https://issues.apache.org/jira/browse/FLINK-2349 Project: Flink Issue Type: Bug Components: Tests Reporter: Matthias J. Sax Instable Test fails regularly: - https://travis-ci.org/apache/flink/builds/70397048 - https://travis-ci.org/mjsax/flink/jobs/70432777 - https://travis-ci.org/mjsax/flink/jobs/70432616 - https://travis-ci.org/mjsax/flink/jobs/70386808 Failed tests: ProcessFailureStreamingRecoveryITCaseAbstractProcessFailureRecoveryTest.testTaskManagerProcessFailure:198 The program encountered a ProgramInvocationException : The program execution failed: Job execution failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)