Unfortunately, not locally, so I ended in crude debugging and code analysis
to understand where the static reloptcluster created for the static
relbuilder was used in a context which would cause issue in a multithreaded
context. Once I saw several rel nodes getting the parent cluster (the
static table scan) to access the metadata query (which cannot be reused
between threads), that's where I assumed this would be the problem.

On Wed, May 29, 2019 at 2:58 PM Stamatis Zampetakis <[email protected]>
wrote:

> Thanks a lot Laurent, much appreciated!
>
> BTW, were you able to reproduce the problem?
>
> On Wed, May 29, 2019 at 11:29 PM Laurent Goujon <[email protected]>
> wrote:
>
> > Here's the pull request: https://github.com/apache/calcite/pull/1240
> >
> > On Wed, May 29, 2019 at 1:58 PM Laurent Goujon <[email protected]>
> wrote:
> >
> > > Looks like most {{RelNode#create()}} access the
> RelOptCluster/RelBuilder
> > > instance from their child, and some then, perform metadata operation,
> > which
> > > would cause CyclicMetadataException for example. I'll create a fixup
> > patch.
> > >
> > > On Wed, May 29, 2019 at 9:00 AM Laurent Goujon <[email protected]>
> > wrote:
> > >
> > >> I actually did the change from using a static relbuilder to a mix of
> > >> static (to create the scans) and per test because of concurrency
> issues
> > :(
> > >> Maybe the test should only been using a per test-case relbuilder
> > instead,
> > >> but it would mean that there's some code relying on the static
> > relbuilder
> > >> (by accessing it from the scan rel nodes) in a non-static context.
> I'll
> > try
> > >> to debug it too (but so far was unlucky reproducing the jenkins
> > problem).
> > >>
> > >> On Wed, May 29, 2019 at 5:02 AM Francis Chuang <
> > [email protected]>
> > >> wrote:
> > >>
> > >>> In Go, there's a built in race detector that can be used when running
> > >>> tests using `go test`.
> > >>>
> > >>> There's RacerD [1] from Facebook that can detect races in Java.
> Perhaps
> > >>> this is something that can be looked in to to find the race.
> > >>>
> > >>> [1] https://fbinfer.com/docs/racerd.html
> > >>>
> > >>> On 29/05/2019 9:52 pm, Stamatis Zampetakis wrote:
> > >>> > Good insights, Ruben, Danny!
> > >>> >
> > >>> > Assuming that is indeed a concurrency problem, it will be difficult
> > to
> > >>> > identify since many parts in Calcite are not thread safe. Not being
> > >>> able to
> > >>> > reproduce the problem makes the things even worse.
> > >>> >
> > >>> > Given that intermittent test failures occur often on Jenkins, how
> > about
> > >>> > creating a new branch dedicated to debugging?
> > >>> > It can be associated with a new Jenkins jobs (that obviously
> doesn't
> > >>> send
> > >>> > mails to everybody when there are failures) that can be launched
> > >>> on-demand
> > >>> > by the person who is looking into the problem.
> > >>> >
> > >>> > On the other hand, we do not really make an effort to have tests
> that
> > >>> can
> > >>> > be executed concurrently so another alternative would be to run
> tests
> > >>> only
> > >>> > sequentially.
> > >>> > I am not very fan of this approach since it will rather hide
> problems
> > >>> than
> > >>> > solve them.
> > >>> >
> > >>> >
> > >>> >
> > >>> >
> > >>> >
> > >>> >
> > >>> > On Wed, May 29, 2019 at 1:02 PM Yuzhao Chen <[email protected]>
> > >>> wrote:
> > >>> >
> > >>> >> Thanks Ruben for your good analysis.
> > >>> >>
> > >>> >> What I’m confused is that isn’t the static REL_BUILDER more prone
> to
> > >>> have
> > >>> >> concurrency problems ? And the pushed scans(EMP_SCAN and
> DEPT_SCAN)
> > >>> are all
> > >>> >> nodes(immutable), how could this be a problem ?
> > >>> >>
> > >>> >> Best,
> > >>> >> Danny Chan
> > >>> >> 在 2019年5月29日 +0800 PM5:37,Ruben Q L <[email protected]>,写道:
> > >>> >>> I'm checking the commit [1] and I see something strange in
> > >>> >> RelOptUtilTest.
> > >>> >>> Maybe I'm wrong and it is nothing, but just in case it may help:
> > >>> >>>
> > >>> >>> With the latest modification, it seems that we have two
> > >>> RelBuilder(s) in
> > >>> >>> place:
> > >>> >>> - A static one that is created ad-hoc on a static block to
> generate
> > >>> the
> > >>> >>> EMP_SCAN and DEPT_SCAN RelNodes [2]
> > >>> >>> - An instance one to be used in the tests, that is initialized on
> > >>> >>> the @Before public void setUp() method [3]
> > >>> >>>
> > >>> >>> Before this commit, the EMP_SCAN / DEPT_SCAN were only used to
> read
> > >>> their
> > >>> >>> rowTypes to test some join auxiliary methods. But the new
> > >>> >>> tests testPushDownJoinConditions* actually build a plan and push
> > >>> these
> > >>> >>> scans into the RelBuilder to be tested [4] (which is a different
> > one
> > >>> than
> > >>> >>> the static RelBuider that created the scans).
> > >>> >>> Maybe this is no problem generally, but it can potentially be
> under
> > >>> >> certain
> > >>> >>> circumstances?, which would explain the randomness of the issue.
> > >>> >>> Could this explain the exception?
> > >>> >>>
> > >>> >>> [1]
> > >>> >>>
> > >>> >>
> > >>>
> >
> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
> > >>> >>>
> > >>> >>> [2]
> > >>> >>>
> > >>> >>
> > >>>
> >
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L71
> > >>> >>> [3]
> > >>> >>>
> > >>> >>
> > >>>
> >
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L92
> > >>> >>> [4]
> > >>> >>>
> > >>> >>
> > >>>
> >
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L292
> > >>> >>>
> > >>> >>>
> > >>> >>>
> > >>> >>> Le mer. 29 mai 2019 à 02:20, Julian Hyde <[email protected]> a
> > écrit
> > >>> :
> > >>> >>>
> > >>> >>>> It’s a tough call. It is probable that the problem existed
> already
> > >>> and
> > >>> >> the
> > >>> >>>> change merely surfaced it.
> > >>> >>>>
> > >>> >>>>> On May 28, 2019, at 5:17 PM, Stamatis Zampetakis <
> > >>> [email protected]>
> > >>> >>>> wrote:
> > >>> >>>>>
> > >>> >>>>> It is not the only test that is failing after commit [1] but
> all
> > >>> the
> > >>> >> new
> > >>> >>>>> tests that were added.
> > >>> >>>>>
> > >>> >>>>> I've seen the problem on Jenkins on all JDKS but I cannot
> > reproduce
> > >>> >> it
> > >>> >>>>> locally.
> > >>> >>>>> I guess we have to do with a race condition most likely due to
> > the
> > >>> >>>>> concurrent execution of tests with surefire.
> > >>> >>>>>
> > >>> >>>>> Should we revert the commit till we find a solution?
> > >>> >>>>>
> > >>> >>>>> [1]
> > >>> >>>>>
> > >>> >>>>
> > >>> >>
> > >>>
> >
> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
> > >>> >>>>>
> > >>> >>>>> On Tue, May 28, 2019 at 7:57 PM Julian Hyde <[email protected]>
> > >>> >> wrote:
> > >>> >>>>>
> > >>> >>>>>> I have seen this intermittent failure 3 times in the last
> week:
> > >>> >>>>>>
> > >>> >>>>>> [INFO] Running org.apache.calcite.plan.RelOptUtilTest
> > >>> >>>>>> [ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0,
> Time
> > >>> >> elapsed:
> > >>> >>>>>> 0.411 s <<< FAILURE! - in
> org.apache.calcite.plan.RelOptUtilTest
> > >>> >>>>>> [ERROR]
> > >>> >>>>>>
> > >>> >>>>
> > >>> >>
> > >>>
> >
> testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(org.apache.calcite.plan.RelOptUtilTest)
> > >>> >>>>>> Time elapsed: 0.349 s <<< ERROR!
> > >>> >>>>>> org.apache.calcite.rel.metadata.CyclicMetadataException
> > >>> >>>>>> at
> > >>> >>>>>>
> > >>> >>>>
> > >>> >>
> > >>>
> >
> org.apache.calcite.plan.RelOptUtilTest.testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(RelOptUtilTest.java:445)
> > >>> >>>>>>
> > >>> >>>>>> I have seen it on Oracle JDK 12 and OpenJDK 10. The test was
> > only
> > >>> >> added
> > >>> >>>> on
> > >>> >>>>>> May 22 so I assume that it will continue to fail
> intermittently
> > >>> >> until
> > >>> >>>> we do
> > >>> >>>>>> something.
> > >>> >>>>>>
> > >>> >>>>>> Anyone have any ideas?
> > >>> >>>>>>
> > >>> >>>>>> Laurent, As you added the test can you please look into it?
> > >>> >>>>>>
> > >>> >>>>>> Julian
> > >>> >>>>>>
> > >>> >>>>>>
> > >>> >>>>
> > >>> >>>>
> > >>> >>
> > >>> >
> > >>>
> > >>
> >
>

Reply via email to