Re: [DISCUSS] Using timeouts in JUnit tests

Till Rohrmann Fri, 30 Apr 2021 02:52:31 -0700

Yes, you can click on the test job where a test failed. Then you can click
on 1 artifact produced (or on the overview page on the X published
(artifacts)). This brings you to the published artifacts page where we
upload for every job the logs.


Cheers,
Till

On Fri, Apr 30, 2021 at 9:22 AM Dong Lin <[email protected]> wrote:

> Thanks Till. Yes you are right. The INFO logging is enabled. It is just
> dumped to a file (the FileAppender) other than the console.
>
> There is probably a way to retrieve that log file from AZP. I will ask
> other colleagues how to get this later.
>
> On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <[email protected]>
> wrote:
>
> > I think for the maven tests we use this log4j.properties file [1].
> >
> > [1]
> https://github.com/apache/flink/blob/master/tools/ci/log4j.properties
> >
> > Cheers,
> > Till
> >
> > On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <[email protected]> wrote:
> >
> > > Thanks for the detailed explanations! Regarding the usage of timeout,
> > now I
> > > agree that it is better to remove per-test timeouts because it helps
> > > make our testing results more reliable and consistent.
> > >
> > > My previous concern is that it might not be a good idea to
> intentionally
> > > let the test hang in AZP in order to get the thread dump. Now I get
> that
> > > there are a few practical concerns around the usage of timeout which
> > makes
> > > testing results unreliable (e.g. flakiness in the presence of VM
> > > migration).
> > >
> > > Regarding the level logging on AZP, it appears that we actually set
> > > "rootLogger.level = OFF" in most log4j2-test.properties, which means
> that
> > > no INFO log would be printed on AZP. For example, I tried to increase
> the
> > > log level in this <https://github.com/apache/flink/pull/15617> PR and
> > was
> > > suggested in this
> > > <
> > >
> >
> https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055
> > > >
> > > comment to avoid increasing the log level. Did I miss something here?
> > >
> > >
> > > On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <[email protected]> wrote:
> > >
> > > > Just to add to Dong Lin's list of cons of allowing timeout:
> > > > - Any timeout value that you manually set is arbitrary. If it's set
> too
> > > > low, you get test instabilities. What too low means depends on
> numerous
> > > > factors, such as hardware and current utilization (especially I/O).
> If
> > > you
> > > > run in VMs and the VM is migrated while running a test, any
> reasonable
> > > > timeout will probably fail. While you could make a similar case for
> the
> > > > overall timeout of tests, any smaller hiccup in the range of minutes
> > will
> > > > not impact the overall runtime much. The probability of having a VM
> > > > constantly migrating during the same stage is abysmally low.
> > > > - A timeout is more maintenance-intensive. It's one more knob where
> you
> > > can
> > > > tweak a build or not. If you change the test a bit, you also need to
> > > > double-check the timeout. Hence, there have been quite a few commits
> > that
> > > > just increase timeouts.
> > > > - Whether a test uses a timeout or not is arbitrary: Why do some ITs
> > > have a
> > > > timeout and others don't? All IT tests are prone to timeout if there
> > are
> > > > issues with resource allocation. Similarly, there are quite a few
> unit
> > > > tests with timeouts while others don't have them with no obvious
> > pattern.
> > > > - An ill-set timeout reduces build reproducibility. Imagine having a
> > > > release with such a timeout and the users cannot build Flink
> reliably.
> > > >
> > > > I'd like to also point out that we should not cater around unstable
> > tests
> > > > if our overall goal is to have as many green builds as possible. If
> we
> > > > assume that our builds fail more often than not, we should also look
> > into
> > > > the other direction and continue the builds on error. I'm not a big
> fan
> > > of
> > > > that.
> > > >
> > > > One argument that I also heard is that it eases local debugging in
> case
> > > of
> > > > refactorings as you can see multiple failures at the same time. But
> no
> > > one
> > > > is keeping you from temporarily adding a timeout on your branch.
> Then,
> > we
> > > > can be sure that the timeout is plausible for your hardware and avoid
> > all
> > > > above mentioned drawbacks.
> > > >
> > > > @Robert Metzger <[email protected]>
> > > >
> > > > > If we had a global limit of 1 minute per test, we would have caught
> > > this
> > > > > case (and we would encourage people to be careful with CI time).
> > > > >
> > > > There are quite a few tests that run longer, especially on a well
> > > utilized
> > > > build machine. A global limit is even worse than individual limits as
> > > there
> > > > is no value that fits it all. If you screwed up and 200 tests hang,
> > you'd
> > > > also run into the global timeout anyway. I'm also not sure what these
> > > > additional hangs bring you except a huge log.
> > > >
> > > > I'm also not sure if it's really better in terms of CI time. For
> > example,
> > > > for UnalignedCheckpointRescaleITCase, we test all known partitioners
> in
> > > one
> > > > pipeline for correctness. For higher parallelism, that means the test
> > > runs
> > > > over 1 minute regularly. If there is a global limit, I'd need to
> split
> > > the
> > > > test into smaller chunks, where I'm positive that the sum of the
> chunks
> > > > will be larger than before.
> > > >
> > > > PS: all tests on AZP will print INFO in the artifacts. There you can
> > also
> > > > retrieve the stacktraces.
> > > > PPS: I also said that we should revalidate the current timeout on
> AZP.
> > So
> > > > the argument that we have >2h of precious CI time wasted is kind of
> > > > constructed and is just due to some random defaults.
> > > >
> > > > On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <[email protected]>
> > > > wrote:
> > > >
> > > > > I think we do capture the INFO logs of the test runs on AZP.
> > > > >
> > > > > I am also not sure whether we really caught slow tests with Junit's
> > > > timeout
> > > > > rule before. I think the default is usually to increase the timeout
> > to
> > > > make
> > > > > the test pass. One way to find slow tests is to measure the time
> and
> > > look
> > > > > at the outliers.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <[email protected]>
> > wrote:
> > > > >
> > > > > > There is one more point that may be useful to consider here.
> > > > > >
> > > > > > In order to debug deadlock that is not easily reproducible, it is
> > > > likely
> > > > > > not sufficient to see only the thread dump to figure out the root
> > > > cause.
> > > > > We
> > > > > > likely need to enable the INFO level logging. Since AZP does not
> > > > provide
> > > > > > INFO level logging by default, we either need to reproduce the
> bug
> > > > > locally
> > > > > > or change the AZP log4j temporarily. This further reduces the
> > benefit
> > > > of
> > > > > > logging the thread dump (which comes at the cost of letting the
> AZP
> > > job
> > > > > > hang).
> > > > > >
> > > > > >
> > > > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Just to make sure I understand the proposal correctly: is the
> > > > proposal
> > > > > to
> > > > > > > disallow the usage of @Test(timeout=...) for Flink Junit tests?
> > > > > > >
> > > > > > > Here is my understanding of the pros/cons according to the
> > > discussion
> > > > > so
> > > > > > > far.
> > > > > > >
> > > > > > > Pros of allowing timeout:
> > > > > > > 1) When there are tests that are unreasonably slow, it helps us
> > > > > > > catch those tests and thus increase the quality of unit tests.
> > > > > > > 2) When there are tests that cause deadlock, it helps the AZP
> job
> > > > fail
> > > > > > > fast instead of being blocked for 4 hours. This saves resources
> > and
> > > > > also
> > > > > > > allows developers to get their PR tested again earlier (useful
> > when
> > > > the
> > > > > > > test failure is not relevant to their PR).
> > > > > > >
> > > > > > > Cons of allowing timeout:
> > > > > > > 1) When there are tests that cause deadlock, we could not see
> the
> > > > > thread
> > > > > > > dump of all threads, which makes debugging the issue harder.
> > > > > > >
> > > > > > > I would suggest that we should still allow timeout because the
> > pros
> > > > > > > outweigh the cons.
> > > > > > >
> > > > > > > As far as I can tell, if we allow timeout and encounter a
> > deadlock
> > > > bug
> > > > > in
> > > > > > > AZP, we still know which test (or test suite) fails. There is a
> > > good
> > > > > > chance
> > > > > > > we can reproduce the deadlock locally (by running it 100 times)
> > and
> > > > get
> > > > > > the
> > > > > > > debug information we need. In the rare case where the deadlock
> > > > happens
> > > > > > only
> > > > > > > on AZP, we can just disable the timeout for that particular
> test.
> > > So
> > > > > the
> > > > > > > lack of thread dump is not really a concern.
> > > > > > >
> > > > > > > On the other hand, if we disallow timeout, it will be very hard
> > for
> > > > us
> > > > > to
> > > > > > > catch low-quality tests. I don't know if there is a good
> > > alternative
> > > > > way
> > > > > > to
> > > > > > > catch those tests.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <
> > > > > [email protected]
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Hi devs!
> > > > > > >>
> > > > > > >> I wanted to bring up something that was discussed in a few
> > > > independent
> > > > > > >> groups of people in the past days. I'd like to revise using
> > > timeouts
> > > > > in
> > > > > > >> our JUnit tests. The suggestion would be not to use them
> > anymore.
> > > > The
> > > > > > >> problem with timeouts is that we have no thread dump and stack
> > > > traces
> > > > > of
> > > > > > >> the system as it hangs. If we were not using a timeout, the CI
> > > > runner
> > > > > > >> would have caught the timeout and created a thread dump which
> > > often
> > > > > is a
> > > > > > >> great starting point for debugging.
> > > > > > >>
> > > > > > >> This problem has been spotted e.g. during debugging
> > > FLINK-22416[1].
> > > > In
> > > > > > >> the past thread dumps were not always taken for hanging tests,
> > but
> > > > it
> > > > > > >> was changed quite recently in FLINK-21346[2]. I am happy to
> hear
> > > > your
> > > > > > >> opinions on it. If there are no objections I would like to add
> > the
> > > > > > >> suggestion to the Coding Guidelines[3]
> > > > > > >>
> > > > > > >> Best,
> > > > > > >>
> > > > > > >> Dawid
> > > > > > >>
> > > > > > >>
> > > > > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416
> > > > > > >>
> > > > > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346
> > > > > > >>
> > > > > > >> [3]
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Using timeouts in JUnit tests

Reply via email to