@Haisheng: since you can reproduce it every time locally can you share the
stacktrace at the moment that it is stucked?

I think it will help a lot in finding the problem.

On Fri, May 31, 2019, 8:22 AM Ruben Q L <[email protected]> wrote:

> Ok, I understand. I'll see if I can reproduce it on my side.
> Thanks
>
>
> Le ven. 31 mai 2019 à 08:19, Haisheng Yuan <[email protected]> a
> écrit :
>
>> just run that single test would not reproduce.
>> It also passes on my machine if run it separately. mvn clean install can
>> repro every time.
>>
>>
>>
>>
>>
>> Thanks~
>> Haisheng
>> Yuan------------------------------------------------------------------
>> 发件人:Ruben Q L<[email protected]>
>> 日 期:2019年05月31日 14:03:44
>> 收件人:<[email protected]>
>> 抄 送:Stamatis Zampetakis<[email protected]>
>> 主 题:Re: Re: CyclicMetadataException in
>> testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase
>>
>> Thanks for the info.
>> I'm running EnumerableRepeatUnionHierarchyTest, and in my local
>> environment
>> (macOS 10.14.5, JDK 8) it ends successfully.
>> I'll continue digging into it...
>>
>>
>> Le ven. 31 mai 2019 à 04:20, Haisheng Yuan <[email protected]> a
>> écrit :
>>
>> > The test suite stuck on my laptop for ever. macOS 10.14.4, JDK 8.
>> > Ignoring test case EnumerableRepeatUnionHierarchyTest makes the whole
>> test
>> > pass again.
>> >
>> > Thanks ~
>> > Haisheng Yuan
>> > ------------------------------------------------------------------
>> > 发件人:Stamatis Zampetakis<[email protected]>
>> > 日 期:2019年05月30日 22:49:22
>> > 收件人:<[email protected]>
>> > 主 题:Re: CyclicMetadataException in
>> > testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase
>> >
>> > Thanks for noticing Laurent!
>> >
>> > The last builds were successful so I guess we have to do with another
>> > intermittent failure.
>> >
>> > If the problem persists, I will create a JIRA and follow up there 😉
>> >
>> > On Thu, May 30, 2019, 6:43 AM Laurent Goujon <[email protected]>
>> wrote:
>> >
>> > > Looks like Jenkins started to timeout with the following change:
>> > >
>> > >
>> >
>> https://github.com/apache/calcite/commit/73e6d05fa65f16485caca80571d1fe4fda5c7468
>> > > .
>> > >
>> > > Ruben, Stamatis: any idea?
>> > >
>> > > On Wed, May 29, 2019 at 8:22 PM Laurent Goujon <[email protected]>
>> > wrote:
>> > >
>> > > > It looks like Jenkins build with JDK8 now takes more than 40min and
>> is
>> > > > being aborted, with the core module taking something like 38min and
>> > being
>> > > > stopped during tests. It seems to be independent of my test change
>> > since
>> > > > RelOptUtilTest shows to be running successfully in 1second.
>> > > > That said, I don't see a test still be running from the logs, so I
>> > guess
>> > > > it would be something keeping the jvm from stopping even if all
>> tests
>> > are
>> > > > completed?
>> > > >
>> > > > On Wed, May 29, 2019 at 4:15 PM Stamatis Zampetakis <
>> [email protected]
>> > >
>> > > > wrote:
>> > > >
>> > > >> Makes sense, I was expecting something like that.
>> > > >>
>> > > >> +1 for the PR. Even if we are not 100% that will resolve the
>> problem,
>> > > >> getting rid of the statics seems like a good idea and beneficial in
>> > any
>> > > >> case.
>> > > >>
>> > > >> On Thu, May 30, 2019 at 12:02 AM Laurent Goujon <
>> [email protected]>
>> > > >> wrote:
>> > > >>
>> > > >> > Unfortunately, not locally, so I ended in crude debugging and
>> code
>> > > >> analysis
>> > > >> > to understand where the static reloptcluster created for the
>> static
>> > > >> > relbuilder was used in a context which would cause issue in a
>> > > >> multithreaded
>> > > >> > context. Once I saw several rel nodes getting the parent cluster
>> > (the
>> > > >> > static table scan) to access the metadata query (which cannot be
>> > > reused
>> > > >> > between threads), that's where I assumed this would be the
>> problem.
>> > > >> >
>> > > >> > On Wed, May 29, 2019 at 2:58 PM Stamatis Zampetakis <
>> > > [email protected]>
>> > > >> > wrote:
>> > > >> >
>> > > >> > > Thanks a lot Laurent, much appreciated!
>> > > >> > >
>> > > >> > > BTW, were you able to reproduce the problem?
>> > > >> > >
>> > > >> > > On Wed, May 29, 2019 at 11:29 PM Laurent Goujon <
>> > [email protected]
>> > > >
>> > > >> > > wrote:
>> > > >> > >
>> > > >> > > > Here's the pull request:
>> > > >> https://github.com/apache/calcite/pull/1240
>> > > >> > > >
>> > > >> > > > On Wed, May 29, 2019 at 1:58 PM Laurent Goujon <
>> > > [email protected]>
>> > > >> > > wrote:
>> > > >> > > >
>> > > >> > > > > Looks like most {{RelNode#create()}} access the
>> > > >> > > RelOptCluster/RelBuilder
>> > > >> > > > > instance from their child, and some then, perform metadata
>> > > >> operation,
>> > > >> > > > which
>> > > >> > > > > would cause CyclicMetadataException for example. I'll
>> create a
>> > > >> fixup
>> > > >> > > > patch.
>> > > >> > > > >
>> > > >> > > > > On Wed, May 29, 2019 at 9:00 AM Laurent Goujon <
>> > > >> [email protected]>
>> > > >> > > > wrote:
>> > > >> > > > >
>> > > >> > > > >> I actually did the change from using a static relbuilder
>> to a
>> > > >> mix of
>> > > >> > > > >> static (to create the scans) and per test because of
>> > > concurrency
>> > > >> > > issues
>> > > >> > > > :(
>> > > >> > > > >> Maybe the test should only been using a per test-case
>> > > relbuilder
>> > > >> > > > instead,
>> > > >> > > > >> but it would mean that there's some code relying on the
>> > static
>> > > >> > > > relbuilder
>> > > >> > > > >> (by accessing it from the scan rel nodes) in a non-static
>> > > >> context.
>> > > >> > > I'll
>> > > >> > > > try
>> > > >> > > > >> to debug it too (but so far was unlucky reproducing the
>> > jenkins
>> > > >> > > > problem).
>> > > >> > > > >>
>> > > >> > > > >> On Wed, May 29, 2019 at 5:02 AM Francis Chuang <
>> > > >> > > > [email protected]>
>> > > >> > > > >> wrote:
>> > > >> > > > >>
>> > > >> > > > >>> In Go, there's a built in race detector that can be used
>> > when
>> > > >> > running
>> > > >> > > > >>> tests using `go test`.
>> > > >> > > > >>>
>> > > >> > > > >>> There's RacerD [1] from Facebook that can detect races in
>> > > Java.
>> > > >> > > Perhaps
>> > > >> > > > >>> this is something that can be looked in to to find the
>> race.
>> > > >> > > > >>>
>> > > >> > > > >>> [1] https://fbinfer.com/docs/racerd.html
>> > > >> > > > >>>
>> > > >> > > > >>> On 29/05/2019 9:52 pm, Stamatis Zampetakis wrote:
>> > > >> > > > >>> > Good insights, Ruben, Danny!
>> > > >> > > > >>> >
>> > > >> > > > >>> > Assuming that is indeed a concurrency problem, it will
>> be
>> > > >> > difficult
>> > > >> > > > to
>> > > >> > > > >>> > identify since many parts in Calcite are not thread
>> safe.
>> > > Not
>> > > >> > being
>> > > >> > > > >>> able to
>> > > >> > > > >>> > reproduce the problem makes the things even worse.
>> > > >> > > > >>> >
>> > > >> > > > >>> > Given that intermittent test failures occur often on
>> > > Jenkins,
>> > > >> how
>> > > >> > > > about
>> > > >> > > > >>> > creating a new branch dedicated to debugging?
>> > > >> > > > >>> > It can be associated with a new Jenkins jobs (that
>> > obviously
>> > > >> > > doesn't
>> > > >> > > > >>> send
>> > > >> > > > >>> > mails to everybody when there are failures) that can be
>> > > >> launched
>> > > >> > > > >>> on-demand
>> > > >> > > > >>> > by the person who is looking into the problem.
>> > > >> > > > >>> >
>> > > >> > > > >>> > On the other hand, we do not really make an effort to
>> have
>> > > >> tests
>> > > >> > > that
>> > > >> > > > >>> can
>> > > >> > > > >>> > be executed concurrently so another alternative would
>> be
>> > to
>> > > >> run
>> > > >> > > tests
>> > > >> > > > >>> only
>> > > >> > > > >>> > sequentially.
>> > > >> > > > >>> > I am not very fan of this approach since it will rather
>> > hide
>> > > >> > > problems
>> > > >> > > > >>> than
>> > > >> > > > >>> > solve them.
>> > > >> > > > >>> >
>> > > >> > > > >>> >
>> > > >> > > > >>> >
>> > > >> > > > >>> >
>> > > >> > > > >>> >
>> > > >> > > > >>> >
>> > > >> > > > >>> > On Wed, May 29, 2019 at 1:02 PM Yuzhao Chen <
>> > > >> > [email protected]>
>> > > >> > > > >>> wrote:
>> > > >> > > > >>> >
>> > > >> > > > >>> >> Thanks Ruben for your good analysis.
>> > > >> > > > >>> >>
>> > > >> > > > >>> >> What I’m confused is that isn’t the static REL_BUILDER
>> > more
>> > > >> > prone
>> > > >> > > to
>> > > >> > > > >>> have
>> > > >> > > > >>> >> concurrency problems ? And the pushed scans(EMP_SCAN
>> and
>> > > >> > > DEPT_SCAN)
>> > > >> > > > >>> are all
>> > > >> > > > >>> >> nodes(immutable), how could this be a problem ?
>> > > >> > > > >>> >>
>> > > >> > > > >>> >> Best,
>> > > >> > > > >>> >> Danny Chan
>> > > >> > > > >>> >> 在 2019年5月29日 +0800 PM5:37,Ruben Q L <
>> [email protected]
>> > > >,写道:
>> > > >> > > > >>> >>> I'm checking the commit [1] and I see something
>> strange
>> > in
>> > > >> > > > >>> >> RelOptUtilTest.
>> > > >> > > > >>> >>> Maybe I'm wrong and it is nothing, but just in case
>> it
>> > may
>> > > >> > help:
>> > > >> > > > >>> >>>
>> > > >> > > > >>> >>> With the latest modification, it seems that we have
>> two
>> > > >> > > > >>> RelBuilder(s) in
>> > > >> > > > >>> >>> place:
>> > > >> > > > >>> >>> - A static one that is created ad-hoc on a static
>> block
>> > to
>> > > >> > > generate
>> > > >> > > > >>> the
>> > > >> > > > >>> >>> EMP_SCAN and DEPT_SCAN RelNodes [2]
>> > > >> > > > >>> >>> - An instance one to be used in the tests, that is
>> > > >> initialized
>> > > >> > on
>> > > >> > > > >>> >>> the @Before public void setUp() method [3]
>> > > >> > > > >>> >>>
>> > > >> > > > >>> >>> Before this commit, the EMP_SCAN / DEPT_SCAN were
>> only
>> > > used
>> > > >> to
>> > > >> > > read
>> > > >> > > > >>> their
>> > > >> > > > >>> >>> rowTypes to test some join auxiliary methods. But the
>> > new
>> > > >> > > > >>> >>> tests testPushDownJoinConditions* actually build a
>> plan
>> > > and
>> > > >> > push
>> > > >> > > > >>> these
>> > > >> > > > >>> >>> scans into the RelBuilder to be tested [4] (which is
>> a
>> > > >> > different
>> > > >> > > > one
>> > > >> > > > >>> than
>> > > >> > > > >>> >>> the static RelBuider that created the scans).
>> > > >> > > > >>> >>> Maybe this is no problem generally, but it can
>> > potentially
>> > > >> be
>> > > >> > > under
>> > > >> > > > >>> >> certain
>> > > >> > > > >>> >>> circumstances?, which would explain the randomness of
>> > the
>> > > >> > issue.
>> > > >> > > > >>> >>> Could this explain the exception?
>> > > >> > > > >>> >>>
>> > > >> > > > >>> >>> [1]
>> > > >> > > > >>> >>>
>> > > >> > > > >>> >>
>> > > >> > > > >>>
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
>> > > >> > > > >>> >>>
>> > > >> > > > >>> >>> [2]
>> > > >> > > > >>> >>>
>> > > >> > > > >>> >>
>> > > >> > > > >>>
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L71
>> > > >> > > > >>> >>> [3]
>> > > >> > > > >>> >>>
>> > > >> > > > >>> >>
>> > > >> > > > >>>
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L92
>> > > >> > > > >>> >>> [4]
>> > > >> > > > >>> >>>
>> > > >> > > > >>> >>
>> > > >> > > > >>>
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L292
>> > > >> > > > >>> >>>
>> > > >> > > > >>> >>>
>> > > >> > > > >>> >>>
>> > > >> > > > >>> >>> Le mer. 29 mai 2019 à 02:20, Julian Hyde <
>> > > [email protected]>
>> > > >> a
>> > > >> > > > écrit
>> > > >> > > > >>> :
>> > > >> > > > >>> >>>
>> > > >> > > > >>> >>>> It’s a tough call. It is probable that the problem
>> > > existed
>> > > >> > > already
>> > > >> > > > >>> and
>> > > >> > > > >>> >> the
>> > > >> > > > >>> >>>> change merely surfaced it.
>> > > >> > > > >>> >>>>
>> > > >> > > > >>> >>>>> On May 28, 2019, at 5:17 PM, Stamatis Zampetakis <
>> > > >> > > > >>> [email protected]>
>> > > >> > > > >>> >>>> wrote:
>> > > >> > > > >>> >>>>>
>> > > >> > > > >>> >>>>> It is not the only test that is failing after
>> commit
>> > [1]
>> > > >> but
>> > > >> > > all
>> > > >> > > > >>> the
>> > > >> > > > >>> >> new
>> > > >> > > > >>> >>>>> tests that were added.
>> > > >> > > > >>> >>>>>
>> > > >> > > > >>> >>>>> I've seen the problem on Jenkins on all JDKS but I
>> > > cannot
>> > > >> > > > reproduce
>> > > >> > > > >>> >> it
>> > > >> > > > >>> >>>>> locally.
>> > > >> > > > >>> >>>>> I guess we have to do with a race condition most
>> > likely
>> > > >> due
>> > > >> > to
>> > > >> > > > the
>> > > >> > > > >>> >>>>> concurrent execution of tests with surefire.
>> > > >> > > > >>> >>>>>
>> > > >> > > > >>> >>>>> Should we revert the commit till we find a
>> solution?
>> > > >> > > > >>> >>>>>
>> > > >> > > > >>> >>>>> [1]
>> > > >> > > > >>> >>>>>
>> > > >> > > > >>> >>>>
>> > > >> > > > >>> >>
>> > > >> > > > >>>
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
>> > > >> > > > >>> >>>>>
>> > > >> > > > >>> >>>>> On Tue, May 28, 2019 at 7:57 PM Julian Hyde <
>> > > >> > [email protected]>
>> > > >> > > > >>> >> wrote:
>> > > >> > > > >>> >>>>>
>> > > >> > > > >>> >>>>>> I have seen this intermittent failure 3 times in
>> the
>> > > last
>> > > >> > > week:
>> > > >> > > > >>> >>>>>>
>> > > >> > > > >>> >>>>>> [INFO] Running
>> org.apache.calcite.plan.RelOptUtilTest
>> > > >> > > > >>> >>>>>> [ERROR] Tests run: 11, Failures: 0, Errors: 1,
>> > Skipped:
>> > > >> 0,
>> > > >> > > Time
>> > > >> > > > >>> >> elapsed:
>> > > >> > > > >>> >>>>>> 0.411 s <<< FAILURE! - in
>> > > >> > > org.apache.calcite.plan.RelOptUtilTest
>> > > >> > > > >>> >>>>>> [ERROR]
>> > > >> > > > >>> >>>>>>
>> > > >> > > > >>> >>>>
>> > > >> > > > >>> >>
>> > > >> > > > >>>
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(org.apache.calcite.plan.RelOptUtilTest)
>> > > >> > > > >>> >>>>>> Time elapsed: 0.349 s <<< ERROR!
>> > > >> > > > >>> >>>>>>
>> > org.apache.calcite.rel.metadata.CyclicMetadataException
>> > > >> > > > >>> >>>>>> at
>> > > >> > > > >>> >>>>>>
>> > > >> > > > >>> >>>>
>> > > >> > > > >>> >>
>> > > >> > > > >>>
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> org.apache.calcite.plan.RelOptUtilTest.testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(RelOptUtilTest.java:445)
>> > > >> > > > >>> >>>>>>
>> > > >> > > > >>> >>>>>> I have seen it on Oracle JDK 12 and OpenJDK 10.
>> The
>> > > test
>> > > >> was
>> > > >> > > > only
>> > > >> > > > >>> >> added
>> > > >> > > > >>> >>>> on
>> > > >> > > > >>> >>>>>> May 22 so I assume that it will continue to fail
>> > > >> > > intermittently
>> > > >> > > > >>> >> until
>> > > >> > > > >>> >>>> we do
>> > > >> > > > >>> >>>>>> something.
>> > > >> > > > >>> >>>>>>
>> > > >> > > > >>> >>>>>> Anyone have any ideas?
>> > > >> > > > >>> >>>>>>
>> > > >> > > > >>> >>>>>> Laurent, As you added the test can you please look
>> > into
>> > > >> it?
>> > > >> > > > >>> >>>>>>
>> > > >> > > > >>> >>>>>> Julian
>> > > >> > > > >>> >>>>>>
>> > > >> > > > >>> >>>>>>
>> > > >> > > > >>> >>>>
>> > > >> > > > >>> >>>>
>> > > >> > > > >>> >>
>> > > >> > > > >>> >
>> > > >> > > > >>>
>> > > >> > > > >>
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > >
>> >
>> >
>>
>>

Reply via email to