I actually did the change from using a static relbuilder to a mix of static (to create the scans) and per test because of concurrency issues :( Maybe the test should only been using a per test-case relbuilder instead, but it would mean that there's some code relying on the static relbuilder (by accessing it from the scan rel nodes) in a non-static context. I'll try to debug it too (but so far was unlucky reproducing the jenkins problem).
On Wed, May 29, 2019 at 5:02 AM Francis Chuang <[email protected]> wrote: > In Go, there's a built in race detector that can be used when running > tests using `go test`. > > There's RacerD [1] from Facebook that can detect races in Java. Perhaps > this is something that can be looked in to to find the race. > > [1] https://fbinfer.com/docs/racerd.html > > On 29/05/2019 9:52 pm, Stamatis Zampetakis wrote: > > Good insights, Ruben, Danny! > > > > Assuming that is indeed a concurrency problem, it will be difficult to > > identify since many parts in Calcite are not thread safe. Not being able > to > > reproduce the problem makes the things even worse. > > > > Given that intermittent test failures occur often on Jenkins, how about > > creating a new branch dedicated to debugging? > > It can be associated with a new Jenkins jobs (that obviously doesn't send > > mails to everybody when there are failures) that can be launched > on-demand > > by the person who is looking into the problem. > > > > On the other hand, we do not really make an effort to have tests that can > > be executed concurrently so another alternative would be to run tests > only > > sequentially. > > I am not very fan of this approach since it will rather hide problems > than > > solve them. > > > > > > > > > > > > > > On Wed, May 29, 2019 at 1:02 PM Yuzhao Chen <[email protected]> > wrote: > > > >> Thanks Ruben for your good analysis. > >> > >> What I’m confused is that isn’t the static REL_BUILDER more prone to > have > >> concurrency problems ? And the pushed scans(EMP_SCAN and DEPT_SCAN) are > all > >> nodes(immutable), how could this be a problem ? > >> > >> Best, > >> Danny Chan > >> 在 2019年5月29日 +0800 PM5:37,Ruben Q L <[email protected]>,写道: > >>> I'm checking the commit [1] and I see something strange in > >> RelOptUtilTest. > >>> Maybe I'm wrong and it is nothing, but just in case it may help: > >>> > >>> With the latest modification, it seems that we have two RelBuilder(s) > in > >>> place: > >>> - A static one that is created ad-hoc on a static block to generate the > >>> EMP_SCAN and DEPT_SCAN RelNodes [2] > >>> - An instance one to be used in the tests, that is initialized on > >>> the @Before public void setUp() method [3] > >>> > >>> Before this commit, the EMP_SCAN / DEPT_SCAN were only used to read > their > >>> rowTypes to test some join auxiliary methods. But the new > >>> tests testPushDownJoinConditions* actually build a plan and push these > >>> scans into the RelBuilder to be tested [4] (which is a different one > than > >>> the static RelBuider that created the scans). > >>> Maybe this is no problem generally, but it can potentially be under > >> certain > >>> circumstances?, which would explain the randomness of the issue. > >>> Could this explain the exception? > >>> > >>> [1] > >>> > >> > https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58 > >>> > >>> [2] > >>> > >> > https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L71 > >>> [3] > >>> > >> > https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L92 > >>> [4] > >>> > >> > https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L292 > >>> > >>> > >>> > >>> Le mer. 29 mai 2019 à 02:20, Julian Hyde <[email protected]> a écrit : > >>> > >>>> It’s a tough call. It is probable that the problem existed already and > >> the > >>>> change merely surfaced it. > >>>> > >>>>> On May 28, 2019, at 5:17 PM, Stamatis Zampetakis <[email protected]> > >>>> wrote: > >>>>> > >>>>> It is not the only test that is failing after commit [1] but all the > >> new > >>>>> tests that were added. > >>>>> > >>>>> I've seen the problem on Jenkins on all JDKS but I cannot reproduce > >> it > >>>>> locally. > >>>>> I guess we have to do with a race condition most likely due to the > >>>>> concurrent execution of tests with surefire. > >>>>> > >>>>> Should we revert the commit till we find a solution? > >>>>> > >>>>> [1] > >>>>> > >>>> > >> > https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58 > >>>>> > >>>>> On Tue, May 28, 2019 at 7:57 PM Julian Hyde <[email protected]> > >> wrote: > >>>>> > >>>>>> I have seen this intermittent failure 3 times in the last week: > >>>>>> > >>>>>> [INFO] Running org.apache.calcite.plan.RelOptUtilTest > >>>>>> [ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time > >> elapsed: > >>>>>> 0.411 s <<< FAILURE! - in org.apache.calcite.plan.RelOptUtilTest > >>>>>> [ERROR] > >>>>>> > >>>> > >> > testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(org.apache.calcite.plan.RelOptUtilTest) > >>>>>> Time elapsed: 0.349 s <<< ERROR! > >>>>>> org.apache.calcite.rel.metadata.CyclicMetadataException > >>>>>> at > >>>>>> > >>>> > >> > org.apache.calcite.plan.RelOptUtilTest.testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(RelOptUtilTest.java:445) > >>>>>> > >>>>>> I have seen it on Oracle JDK 12 and OpenJDK 10. The test was only > >> added > >>>> on > >>>>>> May 22 so I assume that it will continue to fail intermittently > >> until > >>>> we do > >>>>>> something. > >>>>>> > >>>>>> Anyone have any ideas? > >>>>>> > >>>>>> Laurent, As you added the test can you please look into it? > >>>>>> > >>>>>> Julian > >>>>>> > >>>>>> > >>>> > >>>> > >> > > >
