@Haisheng: since you can reproduce it every time locally can you share the stacktrace at the moment that it is stucked?
I think it will help a lot in finding the problem. On Fri, May 31, 2019, 8:22 AM Ruben Q L <[email protected]> wrote: > Ok, I understand. I'll see if I can reproduce it on my side. > Thanks > > > Le ven. 31 mai 2019 à 08:19, Haisheng Yuan <[email protected]> a > écrit : > >> just run that single test would not reproduce. >> It also passes on my machine if run it separately. mvn clean install can >> repro every time. >> >> >> >> >> >> Thanks~ >> Haisheng >> Yuan------------------------------------------------------------------ >> 发件人:Ruben Q L<[email protected]> >> 日 期:2019年05月31日 14:03:44 >> 收件人:<[email protected]> >> 抄 送:Stamatis Zampetakis<[email protected]> >> 主 题:Re: Re: CyclicMetadataException in >> testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase >> >> Thanks for the info. >> I'm running EnumerableRepeatUnionHierarchyTest, and in my local >> environment >> (macOS 10.14.5, JDK 8) it ends successfully. >> I'll continue digging into it... >> >> >> Le ven. 31 mai 2019 à 04:20, Haisheng Yuan <[email protected]> a >> écrit : >> >> > The test suite stuck on my laptop for ever. macOS 10.14.4, JDK 8. >> > Ignoring test case EnumerableRepeatUnionHierarchyTest makes the whole >> test >> > pass again. >> > >> > Thanks ~ >> > Haisheng Yuan >> > ------------------------------------------------------------------ >> > 发件人:Stamatis Zampetakis<[email protected]> >> > 日 期:2019年05月30日 22:49:22 >> > 收件人:<[email protected]> >> > 主 题:Re: CyclicMetadataException in >> > testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase >> > >> > Thanks for noticing Laurent! >> > >> > The last builds were successful so I guess we have to do with another >> > intermittent failure. >> > >> > If the problem persists, I will create a JIRA and follow up there 😉 >> > >> > On Thu, May 30, 2019, 6:43 AM Laurent Goujon <[email protected]> >> wrote: >> > >> > > Looks like Jenkins started to timeout with the following change: >> > > >> > > >> > >> https://github.com/apache/calcite/commit/73e6d05fa65f16485caca80571d1fe4fda5c7468 >> > > . >> > > >> > > Ruben, Stamatis: any idea? >> > > >> > > On Wed, May 29, 2019 at 8:22 PM Laurent Goujon <[email protected]> >> > wrote: >> > > >> > > > It looks like Jenkins build with JDK8 now takes more than 40min and >> is >> > > > being aborted, with the core module taking something like 38min and >> > being >> > > > stopped during tests. It seems to be independent of my test change >> > since >> > > > RelOptUtilTest shows to be running successfully in 1second. >> > > > That said, I don't see a test still be running from the logs, so I >> > guess >> > > > it would be something keeping the jvm from stopping even if all >> tests >> > are >> > > > completed? >> > > > >> > > > On Wed, May 29, 2019 at 4:15 PM Stamatis Zampetakis < >> [email protected] >> > > >> > > > wrote: >> > > > >> > > >> Makes sense, I was expecting something like that. >> > > >> >> > > >> +1 for the PR. Even if we are not 100% that will resolve the >> problem, >> > > >> getting rid of the statics seems like a good idea and beneficial in >> > any >> > > >> case. >> > > >> >> > > >> On Thu, May 30, 2019 at 12:02 AM Laurent Goujon < >> [email protected]> >> > > >> wrote: >> > > >> >> > > >> > Unfortunately, not locally, so I ended in crude debugging and >> code >> > > >> analysis >> > > >> > to understand where the static reloptcluster created for the >> static >> > > >> > relbuilder was used in a context which would cause issue in a >> > > >> multithreaded >> > > >> > context. Once I saw several rel nodes getting the parent cluster >> > (the >> > > >> > static table scan) to access the metadata query (which cannot be >> > > reused >> > > >> > between threads), that's where I assumed this would be the >> problem. >> > > >> > >> > > >> > On Wed, May 29, 2019 at 2:58 PM Stamatis Zampetakis < >> > > [email protected]> >> > > >> > wrote: >> > > >> > >> > > >> > > Thanks a lot Laurent, much appreciated! >> > > >> > > >> > > >> > > BTW, were you able to reproduce the problem? >> > > >> > > >> > > >> > > On Wed, May 29, 2019 at 11:29 PM Laurent Goujon < >> > [email protected] >> > > > >> > > >> > > wrote: >> > > >> > > >> > > >> > > > Here's the pull request: >> > > >> https://github.com/apache/calcite/pull/1240 >> > > >> > > > >> > > >> > > > On Wed, May 29, 2019 at 1:58 PM Laurent Goujon < >> > > [email protected]> >> > > >> > > wrote: >> > > >> > > > >> > > >> > > > > Looks like most {{RelNode#create()}} access the >> > > >> > > RelOptCluster/RelBuilder >> > > >> > > > > instance from their child, and some then, perform metadata >> > > >> operation, >> > > >> > > > which >> > > >> > > > > would cause CyclicMetadataException for example. I'll >> create a >> > > >> fixup >> > > >> > > > patch. >> > > >> > > > > >> > > >> > > > > On Wed, May 29, 2019 at 9:00 AM Laurent Goujon < >> > > >> [email protected]> >> > > >> > > > wrote: >> > > >> > > > > >> > > >> > > > >> I actually did the change from using a static relbuilder >> to a >> > > >> mix of >> > > >> > > > >> static (to create the scans) and per test because of >> > > concurrency >> > > >> > > issues >> > > >> > > > :( >> > > >> > > > >> Maybe the test should only been using a per test-case >> > > relbuilder >> > > >> > > > instead, >> > > >> > > > >> but it would mean that there's some code relying on the >> > static >> > > >> > > > relbuilder >> > > >> > > > >> (by accessing it from the scan rel nodes) in a non-static >> > > >> context. >> > > >> > > I'll >> > > >> > > > try >> > > >> > > > >> to debug it too (but so far was unlucky reproducing the >> > jenkins >> > > >> > > > problem). >> > > >> > > > >> >> > > >> > > > >> On Wed, May 29, 2019 at 5:02 AM Francis Chuang < >> > > >> > > > [email protected]> >> > > >> > > > >> wrote: >> > > >> > > > >> >> > > >> > > > >>> In Go, there's a built in race detector that can be used >> > when >> > > >> > running >> > > >> > > > >>> tests using `go test`. >> > > >> > > > >>> >> > > >> > > > >>> There's RacerD [1] from Facebook that can detect races in >> > > Java. >> > > >> > > Perhaps >> > > >> > > > >>> this is something that can be looked in to to find the >> race. >> > > >> > > > >>> >> > > >> > > > >>> [1] https://fbinfer.com/docs/racerd.html >> > > >> > > > >>> >> > > >> > > > >>> On 29/05/2019 9:52 pm, Stamatis Zampetakis wrote: >> > > >> > > > >>> > Good insights, Ruben, Danny! >> > > >> > > > >>> > >> > > >> > > > >>> > Assuming that is indeed a concurrency problem, it will >> be >> > > >> > difficult >> > > >> > > > to >> > > >> > > > >>> > identify since many parts in Calcite are not thread >> safe. >> > > Not >> > > >> > being >> > > >> > > > >>> able to >> > > >> > > > >>> > reproduce the problem makes the things even worse. >> > > >> > > > >>> > >> > > >> > > > >>> > Given that intermittent test failures occur often on >> > > Jenkins, >> > > >> how >> > > >> > > > about >> > > >> > > > >>> > creating a new branch dedicated to debugging? >> > > >> > > > >>> > It can be associated with a new Jenkins jobs (that >> > obviously >> > > >> > > doesn't >> > > >> > > > >>> send >> > > >> > > > >>> > mails to everybody when there are failures) that can be >> > > >> launched >> > > >> > > > >>> on-demand >> > > >> > > > >>> > by the person who is looking into the problem. >> > > >> > > > >>> > >> > > >> > > > >>> > On the other hand, we do not really make an effort to >> have >> > > >> tests >> > > >> > > that >> > > >> > > > >>> can >> > > >> > > > >>> > be executed concurrently so another alternative would >> be >> > to >> > > >> run >> > > >> > > tests >> > > >> > > > >>> only >> > > >> > > > >>> > sequentially. >> > > >> > > > >>> > I am not very fan of this approach since it will rather >> > hide >> > > >> > > problems >> > > >> > > > >>> than >> > > >> > > > >>> > solve them. >> > > >> > > > >>> > >> > > >> > > > >>> > >> > > >> > > > >>> > >> > > >> > > > >>> > >> > > >> > > > >>> > >> > > >> > > > >>> > >> > > >> > > > >>> > On Wed, May 29, 2019 at 1:02 PM Yuzhao Chen < >> > > >> > [email protected]> >> > > >> > > > >>> wrote: >> > > >> > > > >>> > >> > > >> > > > >>> >> Thanks Ruben for your good analysis. >> > > >> > > > >>> >> >> > > >> > > > >>> >> What I’m confused is that isn’t the static REL_BUILDER >> > more >> > > >> > prone >> > > >> > > to >> > > >> > > > >>> have >> > > >> > > > >>> >> concurrency problems ? And the pushed scans(EMP_SCAN >> and >> > > >> > > DEPT_SCAN) >> > > >> > > > >>> are all >> > > >> > > > >>> >> nodes(immutable), how could this be a problem ? >> > > >> > > > >>> >> >> > > >> > > > >>> >> Best, >> > > >> > > > >>> >> Danny Chan >> > > >> > > > >>> >> 在 2019年5月29日 +0800 PM5:37,Ruben Q L < >> [email protected] >> > > >,写道: >> > > >> > > > >>> >>> I'm checking the commit [1] and I see something >> strange >> > in >> > > >> > > > >>> >> RelOptUtilTest. >> > > >> > > > >>> >>> Maybe I'm wrong and it is nothing, but just in case >> it >> > may >> > > >> > help: >> > > >> > > > >>> >>> >> > > >> > > > >>> >>> With the latest modification, it seems that we have >> two >> > > >> > > > >>> RelBuilder(s) in >> > > >> > > > >>> >>> place: >> > > >> > > > >>> >>> - A static one that is created ad-hoc on a static >> block >> > to >> > > >> > > generate >> > > >> > > > >>> the >> > > >> > > > >>> >>> EMP_SCAN and DEPT_SCAN RelNodes [2] >> > > >> > > > >>> >>> - An instance one to be used in the tests, that is >> > > >> initialized >> > > >> > on >> > > >> > > > >>> >>> the @Before public void setUp() method [3] >> > > >> > > > >>> >>> >> > > >> > > > >>> >>> Before this commit, the EMP_SCAN / DEPT_SCAN were >> only >> > > used >> > > >> to >> > > >> > > read >> > > >> > > > >>> their >> > > >> > > > >>> >>> rowTypes to test some join auxiliary methods. But the >> > new >> > > >> > > > >>> >>> tests testPushDownJoinConditions* actually build a >> plan >> > > and >> > > >> > push >> > > >> > > > >>> these >> > > >> > > > >>> >>> scans into the RelBuilder to be tested [4] (which is >> a >> > > >> > different >> > > >> > > > one >> > > >> > > > >>> than >> > > >> > > > >>> >>> the static RelBuider that created the scans). >> > > >> > > > >>> >>> Maybe this is no problem generally, but it can >> > potentially >> > > >> be >> > > >> > > under >> > > >> > > > >>> >> certain >> > > >> > > > >>> >>> circumstances?, which would explain the randomness of >> > the >> > > >> > issue. >> > > >> > > > >>> >>> Could this explain the exception? >> > > >> > > > >>> >>> >> > > >> > > > >>> >>> [1] >> > > >> > > > >>> >>> >> > > >> > > > >>> >> >> > > >> > > > >>> >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58 >> > > >> > > > >>> >>> >> > > >> > > > >>> >>> [2] >> > > >> > > > >>> >>> >> > > >> > > > >>> >> >> > > >> > > > >>> >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L71 >> > > >> > > > >>> >>> [3] >> > > >> > > > >>> >>> >> > > >> > > > >>> >> >> > > >> > > > >>> >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L92 >> > > >> > > > >>> >>> [4] >> > > >> > > > >>> >>> >> > > >> > > > >>> >> >> > > >> > > > >>> >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L292 >> > > >> > > > >>> >>> >> > > >> > > > >>> >>> >> > > >> > > > >>> >>> >> > > >> > > > >>> >>> Le mer. 29 mai 2019 à 02:20, Julian Hyde < >> > > [email protected]> >> > > >> a >> > > >> > > > écrit >> > > >> > > > >>> : >> > > >> > > > >>> >>> >> > > >> > > > >>> >>>> It’s a tough call. It is probable that the problem >> > > existed >> > > >> > > already >> > > >> > > > >>> and >> > > >> > > > >>> >> the >> > > >> > > > >>> >>>> change merely surfaced it. >> > > >> > > > >>> >>>> >> > > >> > > > >>> >>>>> On May 28, 2019, at 5:17 PM, Stamatis Zampetakis < >> > > >> > > > >>> [email protected]> >> > > >> > > > >>> >>>> wrote: >> > > >> > > > >>> >>>>> >> > > >> > > > >>> >>>>> It is not the only test that is failing after >> commit >> > [1] >> > > >> but >> > > >> > > all >> > > >> > > > >>> the >> > > >> > > > >>> >> new >> > > >> > > > >>> >>>>> tests that were added. >> > > >> > > > >>> >>>>> >> > > >> > > > >>> >>>>> I've seen the problem on Jenkins on all JDKS but I >> > > cannot >> > > >> > > > reproduce >> > > >> > > > >>> >> it >> > > >> > > > >>> >>>>> locally. >> > > >> > > > >>> >>>>> I guess we have to do with a race condition most >> > likely >> > > >> due >> > > >> > to >> > > >> > > > the >> > > >> > > > >>> >>>>> concurrent execution of tests with surefire. >> > > >> > > > >>> >>>>> >> > > >> > > > >>> >>>>> Should we revert the commit till we find a >> solution? >> > > >> > > > >>> >>>>> >> > > >> > > > >>> >>>>> [1] >> > > >> > > > >>> >>>>> >> > > >> > > > >>> >>>> >> > > >> > > > >>> >> >> > > >> > > > >>> >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58 >> > > >> > > > >>> >>>>> >> > > >> > > > >>> >>>>> On Tue, May 28, 2019 at 7:57 PM Julian Hyde < >> > > >> > [email protected]> >> > > >> > > > >>> >> wrote: >> > > >> > > > >>> >>>>> >> > > >> > > > >>> >>>>>> I have seen this intermittent failure 3 times in >> the >> > > last >> > > >> > > week: >> > > >> > > > >>> >>>>>> >> > > >> > > > >>> >>>>>> [INFO] Running >> org.apache.calcite.plan.RelOptUtilTest >> > > >> > > > >>> >>>>>> [ERROR] Tests run: 11, Failures: 0, Errors: 1, >> > Skipped: >> > > >> 0, >> > > >> > > Time >> > > >> > > > >>> >> elapsed: >> > > >> > > > >>> >>>>>> 0.411 s <<< FAILURE! - in >> > > >> > > org.apache.calcite.plan.RelOptUtilTest >> > > >> > > > >>> >>>>>> [ERROR] >> > > >> > > > >>> >>>>>> >> > > >> > > > >>> >>>> >> > > >> > > > >>> >> >> > > >> > > > >>> >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(org.apache.calcite.plan.RelOptUtilTest) >> > > >> > > > >>> >>>>>> Time elapsed: 0.349 s <<< ERROR! >> > > >> > > > >>> >>>>>> >> > org.apache.calcite.rel.metadata.CyclicMetadataException >> > > >> > > > >>> >>>>>> at >> > > >> > > > >>> >>>>>> >> > > >> > > > >>> >>>> >> > > >> > > > >>> >> >> > > >> > > > >>> >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> org.apache.calcite.plan.RelOptUtilTest.testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(RelOptUtilTest.java:445) >> > > >> > > > >>> >>>>>> >> > > >> > > > >>> >>>>>> I have seen it on Oracle JDK 12 and OpenJDK 10. >> The >> > > test >> > > >> was >> > > >> > > > only >> > > >> > > > >>> >> added >> > > >> > > > >>> >>>> on >> > > >> > > > >>> >>>>>> May 22 so I assume that it will continue to fail >> > > >> > > intermittently >> > > >> > > > >>> >> until >> > > >> > > > >>> >>>> we do >> > > >> > > > >>> >>>>>> something. >> > > >> > > > >>> >>>>>> >> > > >> > > > >>> >>>>>> Anyone have any ideas? >> > > >> > > > >>> >>>>>> >> > > >> > > > >>> >>>>>> Laurent, As you added the test can you please look >> > into >> > > >> it? >> > > >> > > > >>> >>>>>> >> > > >> > > > >>> >>>>>> Julian >> > > >> > > > >>> >>>>>> >> > > >> > > > >>> >>>>>> >> > > >> > > > >>> >>>> >> > > >> > > > >>> >>>> >> > > >> > > > >>> >> >> > > >> > > > >>> > >> > > >> > > > >>> >> > > >> > > > >> >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > > >> > > >> > >> > >> >>
