I actually did the change from using a static relbuilder to a mix of static
(to create the scans) and per test because of concurrency issues :( Maybe
the test should only been using a per test-case relbuilder instead, but it
would mean that there's some code relying on the static relbuilder (by
accessing it from the scan rel nodes) in a non-static context. I'll try to
debug it too (but so far was unlucky reproducing the jenkins problem).

On Wed, May 29, 2019 at 5:02 AM Francis Chuang <[email protected]>
wrote:

> In Go, there's a built in race detector that can be used when running
> tests using `go test`.
>
> There's RacerD [1] from Facebook that can detect races in Java. Perhaps
> this is something that can be looked in to to find the race.
>
> [1] https://fbinfer.com/docs/racerd.html
>
> On 29/05/2019 9:52 pm, Stamatis Zampetakis wrote:
> > Good insights, Ruben, Danny!
> >
> > Assuming that is indeed a concurrency problem, it will be difficult to
> > identify since many parts in Calcite are not thread safe. Not being able
> to
> > reproduce the problem makes the things even worse.
> >
> > Given that intermittent test failures occur often on Jenkins, how about
> > creating a new branch dedicated to debugging?
> > It can be associated with a new Jenkins jobs (that obviously doesn't send
> > mails to everybody when there are failures) that can be launched
> on-demand
> > by the person who is looking into the problem.
> >
> > On the other hand, we do not really make an effort to have tests that can
> > be executed concurrently so another alternative would be to run tests
> only
> > sequentially.
> > I am not very fan of this approach since it will rather hide problems
> than
> > solve them.
> >
> >
> >
> >
> >
> >
> > On Wed, May 29, 2019 at 1:02 PM Yuzhao Chen <[email protected]>
> wrote:
> >
> >> Thanks Ruben for your good analysis.
> >>
> >> What I’m confused is that isn’t the static REL_BUILDER more prone to
> have
> >> concurrency problems ? And the pushed scans(EMP_SCAN and DEPT_SCAN) are
> all
> >> nodes(immutable), how could this be a problem ?
> >>
> >> Best,
> >> Danny Chan
> >> 在 2019年5月29日 +0800 PM5:37,Ruben Q L <[email protected]>,写道:
> >>> I'm checking the commit [1] and I see something strange in
> >> RelOptUtilTest.
> >>> Maybe I'm wrong and it is nothing, but just in case it may help:
> >>>
> >>> With the latest modification, it seems that we have two RelBuilder(s)
> in
> >>> place:
> >>> - A static one that is created ad-hoc on a static block to generate the
> >>> EMP_SCAN and DEPT_SCAN RelNodes [2]
> >>> - An instance one to be used in the tests, that is initialized on
> >>> the @Before public void setUp() method [3]
> >>>
> >>> Before this commit, the EMP_SCAN / DEPT_SCAN were only used to read
> their
> >>> rowTypes to test some join auxiliary methods. But the new
> >>> tests testPushDownJoinConditions* actually build a plan and push these
> >>> scans into the RelBuilder to be tested [4] (which is a different one
> than
> >>> the static RelBuider that created the scans).
> >>> Maybe this is no problem generally, but it can potentially be under
> >> certain
> >>> circumstances?, which would explain the randomness of the issue.
> >>> Could this explain the exception?
> >>>
> >>> [1]
> >>>
> >>
> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
> >>>
> >>> [2]
> >>>
> >>
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L71
> >>> [3]
> >>>
> >>
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L92
> >>> [4]
> >>>
> >>
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L292
> >>>
> >>>
> >>>
> >>> Le mer. 29 mai 2019 à 02:20, Julian Hyde <[email protected]> a écrit :
> >>>
> >>>> It’s a tough call. It is probable that the problem existed already and
> >> the
> >>>> change merely surfaced it.
> >>>>
> >>>>> On May 28, 2019, at 5:17 PM, Stamatis Zampetakis <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>> It is not the only test that is failing after commit [1] but all the
> >> new
> >>>>> tests that were added.
> >>>>>
> >>>>> I've seen the problem on Jenkins on all JDKS but I cannot reproduce
> >> it
> >>>>> locally.
> >>>>> I guess we have to do with a race condition most likely due to the
> >>>>> concurrent execution of tests with surefire.
> >>>>>
> >>>>> Should we revert the commit till we find a solution?
> >>>>>
> >>>>> [1]
> >>>>>
> >>>>
> >>
> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
> >>>>>
> >>>>> On Tue, May 28, 2019 at 7:57 PM Julian Hyde <[email protected]>
> >> wrote:
> >>>>>
> >>>>>> I have seen this intermittent failure 3 times in the last week:
> >>>>>>
> >>>>>> [INFO] Running org.apache.calcite.plan.RelOptUtilTest
> >>>>>> [ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time
> >> elapsed:
> >>>>>> 0.411 s <<< FAILURE! - in org.apache.calcite.plan.RelOptUtilTest
> >>>>>> [ERROR]
> >>>>>>
> >>>>
> >>
> testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(org.apache.calcite.plan.RelOptUtilTest)
> >>>>>> Time elapsed: 0.349 s <<< ERROR!
> >>>>>> org.apache.calcite.rel.metadata.CyclicMetadataException
> >>>>>> at
> >>>>>>
> >>>>
> >>
> org.apache.calcite.plan.RelOptUtilTest.testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(RelOptUtilTest.java:445)
> >>>>>>
> >>>>>> I have seen it on Oracle JDK 12 and OpenJDK 10. The test was only
> >> added
> >>>> on
> >>>>>> May 22 so I assume that it will continue to fail intermittently
> >> until
> >>>> we do
> >>>>>> something.
> >>>>>>
> >>>>>> Anyone have any ideas?
> >>>>>>
> >>>>>> Laurent, As you added the test can you please look into it?
> >>>>>>
> >>>>>> Julian
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >
>

Reply via email to