It looks like Jenkins build with JDK8 now takes more than 40min and is
being aborted, with the core module taking something like 38min and being
stopped during tests. It seems to be independent of my test change since
RelOptUtilTest shows to be running successfully in 1second.
That said, I don't see a test still be running from the logs, so I guess it
would be something keeping the jvm from stopping even if all tests are
completed?

On Wed, May 29, 2019 at 4:15 PM Stamatis Zampetakis <zabe...@gmail.com>
wrote:

> Makes sense, I was expecting something like that.
>
> +1 for the PR. Even if we are not 100% that will resolve the problem,
> getting rid of the statics seems like a good idea and beneficial in any
> case.
>
> On Thu, May 30, 2019 at 12:02 AM Laurent Goujon <laur...@dremio.com>
> wrote:
>
> > Unfortunately, not locally, so I ended in crude debugging and code
> analysis
> > to understand where the static reloptcluster created for the static
> > relbuilder was used in a context which would cause issue in a
> multithreaded
> > context. Once I saw several rel nodes getting the parent cluster (the
> > static table scan) to access the metadata query (which cannot be reused
> > between threads), that's where I assumed this would be the problem.
> >
> > On Wed, May 29, 2019 at 2:58 PM Stamatis Zampetakis <zabe...@gmail.com>
> > wrote:
> >
> > > Thanks a lot Laurent, much appreciated!
> > >
> > > BTW, were you able to reproduce the problem?
> > >
> > > On Wed, May 29, 2019 at 11:29 PM Laurent Goujon <laur...@dremio.com>
> > > wrote:
> > >
> > > > Here's the pull request: https://github.com/apache/calcite/pull/1240
> > > >
> > > > On Wed, May 29, 2019 at 1:58 PM Laurent Goujon <laur...@dremio.com>
> > > wrote:
> > > >
> > > > > Looks like most {{RelNode#create()}} access the
> > > RelOptCluster/RelBuilder
> > > > > instance from their child, and some then, perform metadata
> operation,
> > > > which
> > > > > would cause CyclicMetadataException for example. I'll create a
> fixup
> > > > patch.
> > > > >
> > > > > On Wed, May 29, 2019 at 9:00 AM Laurent Goujon <laur...@dremio.com
> >
> > > > wrote:
> > > > >
> > > > >> I actually did the change from using a static relbuilder to a mix
> of
> > > > >> static (to create the scans) and per test because of concurrency
> > > issues
> > > > :(
> > > > >> Maybe the test should only been using a per test-case relbuilder
> > > > instead,
> > > > >> but it would mean that there's some code relying on the static
> > > > relbuilder
> > > > >> (by accessing it from the scan rel nodes) in a non-static context.
> > > I'll
> > > > try
> > > > >> to debug it too (but so far was unlucky reproducing the jenkins
> > > > problem).
> > > > >>
> > > > >> On Wed, May 29, 2019 at 5:02 AM Francis Chuang <
> > > > francischu...@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >>> In Go, there's a built in race detector that can be used when
> > running
> > > > >>> tests using `go test`.
> > > > >>>
> > > > >>> There's RacerD [1] from Facebook that can detect races in Java.
> > > Perhaps
> > > > >>> this is something that can be looked in to to find the race.
> > > > >>>
> > > > >>> [1] https://fbinfer.com/docs/racerd.html
> > > > >>>
> > > > >>> On 29/05/2019 9:52 pm, Stamatis Zampetakis wrote:
> > > > >>> > Good insights, Ruben, Danny!
> > > > >>> >
> > > > >>> > Assuming that is indeed a concurrency problem, it will be
> > difficult
> > > > to
> > > > >>> > identify since many parts in Calcite are not thread safe. Not
> > being
> > > > >>> able to
> > > > >>> > reproduce the problem makes the things even worse.
> > > > >>> >
> > > > >>> > Given that intermittent test failures occur often on Jenkins,
> how
> > > > about
> > > > >>> > creating a new branch dedicated to debugging?
> > > > >>> > It can be associated with a new Jenkins jobs (that obviously
> > > doesn't
> > > > >>> send
> > > > >>> > mails to everybody when there are failures) that can be
> launched
> > > > >>> on-demand
> > > > >>> > by the person who is looking into the problem.
> > > > >>> >
> > > > >>> > On the other hand, we do not really make an effort to have
> tests
> > > that
> > > > >>> can
> > > > >>> > be executed concurrently so another alternative would be to run
> > > tests
> > > > >>> only
> > > > >>> > sequentially.
> > > > >>> > I am not very fan of this approach since it will rather hide
> > > problems
> > > > >>> than
> > > > >>> > solve them.
> > > > >>> >
> > > > >>> >
> > > > >>> >
> > > > >>> >
> > > > >>> >
> > > > >>> >
> > > > >>> > On Wed, May 29, 2019 at 1:02 PM Yuzhao Chen <
> > yuzhao....@gmail.com>
> > > > >>> wrote:
> > > > >>> >
> > > > >>> >> Thanks Ruben for your good analysis.
> > > > >>> >>
> > > > >>> >> What I’m confused is that isn’t the static REL_BUILDER more
> > prone
> > > to
> > > > >>> have
> > > > >>> >> concurrency problems ? And the pushed scans(EMP_SCAN and
> > > DEPT_SCAN)
> > > > >>> are all
> > > > >>> >> nodes(immutable), how could this be a problem ?
> > > > >>> >>
> > > > >>> >> Best,
> > > > >>> >> Danny Chan
> > > > >>> >> 在 2019年5月29日 +0800 PM5:37,Ruben Q L <rube...@gmail.com>,写道:
> > > > >>> >>> I'm checking the commit [1] and I see something strange in
> > > > >>> >> RelOptUtilTest.
> > > > >>> >>> Maybe I'm wrong and it is nothing, but just in case it may
> > help:
> > > > >>> >>>
> > > > >>> >>> With the latest modification, it seems that we have two
> > > > >>> RelBuilder(s) in
> > > > >>> >>> place:
> > > > >>> >>> - A static one that is created ad-hoc on a static block to
> > > generate
> > > > >>> the
> > > > >>> >>> EMP_SCAN and DEPT_SCAN RelNodes [2]
> > > > >>> >>> - An instance one to be used in the tests, that is
> initialized
> > on
> > > > >>> >>> the @Before public void setUp() method [3]
> > > > >>> >>>
> > > > >>> >>> Before this commit, the EMP_SCAN / DEPT_SCAN were only used
> to
> > > read
> > > > >>> their
> > > > >>> >>> rowTypes to test some join auxiliary methods. But the new
> > > > >>> >>> tests testPushDownJoinConditions* actually build a plan and
> > push
> > > > >>> these
> > > > >>> >>> scans into the RelBuilder to be tested [4] (which is a
> > different
> > > > one
> > > > >>> than
> > > > >>> >>> the static RelBuider that created the scans).
> > > > >>> >>> Maybe this is no problem generally, but it can potentially be
> > > under
> > > > >>> >> certain
> > > > >>> >>> circumstances?, which would explain the randomness of the
> > issue.
> > > > >>> >>> Could this explain the exception?
> > > > >>> >>>
> > > > >>> >>> [1]
> > > > >>> >>>
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
> > > > >>> >>>
> > > > >>> >>> [2]
> > > > >>> >>>
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L71
> > > > >>> >>> [3]
> > > > >>> >>>
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L92
> > > > >>> >>> [4]
> > > > >>> >>>
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> https://github.com/apache/calcite/blob/ac40d6951bc8c475ca6804be6d878107cc2ebb13/core/src/test/java/org/apache/calcite/plan/RelOptUtilTest.java#L292
> > > > >>> >>>
> > > > >>> >>>
> > > > >>> >>>
> > > > >>> >>> Le mer. 29 mai 2019 à 02:20, Julian Hyde <jh...@apache.org>
> a
> > > > écrit
> > > > >>> :
> > > > >>> >>>
> > > > >>> >>>> It’s a tough call. It is probable that the problem existed
> > > already
> > > > >>> and
> > > > >>> >> the
> > > > >>> >>>> change merely surfaced it.
> > > > >>> >>>>
> > > > >>> >>>>> On May 28, 2019, at 5:17 PM, Stamatis Zampetakis <
> > > > >>> zabe...@gmail.com>
> > > > >>> >>>> wrote:
> > > > >>> >>>>>
> > > > >>> >>>>> It is not the only test that is failing after commit [1]
> but
> > > all
> > > > >>> the
> > > > >>> >> new
> > > > >>> >>>>> tests that were added.
> > > > >>> >>>>>
> > > > >>> >>>>> I've seen the problem on Jenkins on all JDKS but I cannot
> > > > reproduce
> > > > >>> >> it
> > > > >>> >>>>> locally.
> > > > >>> >>>>> I guess we have to do with a race condition most likely due
> > to
> > > > the
> > > > >>> >>>>> concurrent execution of tests with surefire.
> > > > >>> >>>>>
> > > > >>> >>>>> Should we revert the commit till we find a solution?
> > > > >>> >>>>>
> > > > >>> >>>>> [1]
> > > > >>> >>>>>
> > > > >>> >>>>
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> https://github.com/apache/calcite/commit/82e7d4e760cb203d31956c55e38e0fdd56119d58
> > > > >>> >>>>>
> > > > >>> >>>>> On Tue, May 28, 2019 at 7:57 PM Julian Hyde <
> > jh...@apache.org>
> > > > >>> >> wrote:
> > > > >>> >>>>>
> > > > >>> >>>>>> I have seen this intermittent failure 3 times in the last
> > > week:
> > > > >>> >>>>>>
> > > > >>> >>>>>> [INFO] Running org.apache.calcite.plan.RelOptUtilTest
> > > > >>> >>>>>> [ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0,
> > > Time
> > > > >>> >> elapsed:
> > > > >>> >>>>>> 0.411 s <<< FAILURE! - in
> > > org.apache.calcite.plan.RelOptUtilTest
> > > > >>> >>>>>> [ERROR]
> > > > >>> >>>>>>
> > > > >>> >>>>
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(org.apache.calcite.plan.RelOptUtilTest)
> > > > >>> >>>>>> Time elapsed: 0.349 s <<< ERROR!
> > > > >>> >>>>>> org.apache.calcite.rel.metadata.CyclicMetadataException
> > > > >>> >>>>>> at
> > > > >>> >>>>>>
> > > > >>> >>>>
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> org.apache.calcite.plan.RelOptUtilTest.testPushDownJoinConditionsWithExpandedIsNotDistinctUsingCase(RelOptUtilTest.java:445)
> > > > >>> >>>>>>
> > > > >>> >>>>>> I have seen it on Oracle JDK 12 and OpenJDK 10. The test
> was
> > > > only
> > > > >>> >> added
> > > > >>> >>>> on
> > > > >>> >>>>>> May 22 so I assume that it will continue to fail
> > > intermittently
> > > > >>> >> until
> > > > >>> >>>> we do
> > > > >>> >>>>>> something.
> > > > >>> >>>>>>
> > > > >>> >>>>>> Anyone have any ideas?
> > > > >>> >>>>>>
> > > > >>> >>>>>> Laurent, As you added the test can you please look into
> it?
> > > > >>> >>>>>>
> > > > >>> >>>>>> Julian
> > > > >>> >>>>>>
> > > > >>> >>>>>>
> > > > >>> >>>>
> > > > >>> >>>>
> > > > >>> >>
> > > > >>> >
> > > > >>>
> > > > >>
> > > >
> > >
> >
>

Reply via email to