Thanks for the detailed explanations! Regarding the usage of timeout, now I
agree that it is better to remove per-test timeouts because it helps
make our testing results more reliable and consistent.

My previous concern is that it might not be a good idea to intentionally
let the test hang in AZP in order to get the thread dump. Now I get that
there are a few practical concerns around the usage of timeout which makes
testing results unreliable (e.g. flakiness in the presence of VM migration).

Regarding the level logging on AZP, it appears that we actually set
"rootLogger.level = OFF" in most log4j2-test.properties, which means that
no INFO log would be printed on AZP. For example, I tried to increase the
log level in this <https://github.com/apache/flink/pull/15617> PR and was
suggested in this
<https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055>
comment to avoid increasing the log level. Did I miss something here?


On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <ar...@apache.org> wrote:

> Just to add to Dong Lin's list of cons of allowing timeout:
> - Any timeout value that you manually set is arbitrary. If it's set too
> low, you get test instabilities. What too low means depends on numerous
> factors, such as hardware and current utilization (especially I/O). If you
> run in VMs and the VM is migrated while running a test, any reasonable
> timeout will probably fail. While you could make a similar case for the
> overall timeout of tests, any smaller hiccup in the range of minutes will
> not impact the overall runtime much. The probability of having a VM
> constantly migrating during the same stage is abysmally low.
> - A timeout is more maintenance-intensive. It's one more knob where you can
> tweak a build or not. If you change the test a bit, you also need to
> double-check the timeout. Hence, there have been quite a few commits that
> just increase timeouts.
> - Whether a test uses a timeout or not is arbitrary: Why do some ITs have a
> timeout and others don't? All IT tests are prone to timeout if there are
> issues with resource allocation. Similarly, there are quite a few unit
> tests with timeouts while others don't have them with no obvious pattern.
> - An ill-set timeout reduces build reproducibility. Imagine having a
> release with such a timeout and the users cannot build Flink reliably.
>
> I'd like to also point out that we should not cater around unstable tests
> if our overall goal is to have as many green builds as possible. If we
> assume that our builds fail more often than not, we should also look into
> the other direction and continue the builds on error. I'm not a big fan of
> that.
>
> One argument that I also heard is that it eases local debugging in case of
> refactorings as you can see multiple failures at the same time. But no one
> is keeping you from temporarily adding a timeout on your branch. Then, we
> can be sure that the timeout is plausible for your hardware and avoid all
> above mentioned drawbacks.
>
> @Robert Metzger <rmetz...@apache.org>
>
> > If we had a global limit of 1 minute per test, we would have caught this
> > case (and we would encourage people to be careful with CI time).
> >
> There are quite a few tests that run longer, especially on a well utilized
> build machine. A global limit is even worse than individual limits as there
> is no value that fits it all. If you screwed up and 200 tests hang, you'd
> also run into the global timeout anyway. I'm also not sure what these
> additional hangs bring you except a huge log.
>
> I'm also not sure if it's really better in terms of CI time. For example,
> for UnalignedCheckpointRescaleITCase, we test all known partitioners in one
> pipeline for correctness. For higher parallelism, that means the test runs
> over 1 minute regularly. If there is a global limit, I'd need to split the
> test into smaller chunks, where I'm positive that the sum of the chunks
> will be larger than before.
>
> PS: all tests on AZP will print INFO in the artifacts. There you can also
> retrieve the stacktraces.
> PPS: I also said that we should revalidate the current timeout on AZP. So
> the argument that we have >2h of precious CI time wasted is kind of
> constructed and is just due to some random defaults.
>
> On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <trohrm...@apache.org>
> wrote:
>
> > I think we do capture the INFO logs of the test runs on AZP.
> >
> > I am also not sure whether we really caught slow tests with Junit's
> timeout
> > rule before. I think the default is usually to increase the timeout to
> make
> > the test pass. One way to find slow tests is to measure the time and look
> > at the outliers.
> >
> > Cheers,
> > Till
> >
> > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <lindon...@gmail.com> wrote:
> >
> > > There is one more point that may be useful to consider here.
> > >
> > > In order to debug deadlock that is not easily reproducible, it is
> likely
> > > not sufficient to see only the thread dump to figure out the root
> cause.
> > We
> > > likely need to enable the INFO level logging. Since AZP does not
> provide
> > > INFO level logging by default, we either need to reproduce the bug
> > locally
> > > or change the AZP log4j temporarily. This further reduces the benefit
> of
> > > logging the thread dump (which comes at the cost of letting the AZP job
> > > hang).
> > >
> > >
> > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <lindon...@gmail.com> wrote:
> > >
> > > > Just to make sure I understand the proposal correctly: is the
> proposal
> > to
> > > > disallow the usage of @Test(timeout=...) for Flink Junit tests?
> > > >
> > > > Here is my understanding of the pros/cons according to the discussion
> > so
> > > > far.
> > > >
> > > > Pros of allowing timeout:
> > > > 1) When there are tests that are unreasonably slow, it helps us
> > > > catch those tests and thus increase the quality of unit tests.
> > > > 2) When there are tests that cause deadlock, it helps the AZP job
> fail
> > > > fast instead of being blocked for 4 hours. This saves resources and
> > also
> > > > allows developers to get their PR tested again earlier (useful when
> the
> > > > test failure is not relevant to their PR).
> > > >
> > > > Cons of allowing timeout:
> > > > 1) When there are tests that cause deadlock, we could not see the
> > thread
> > > > dump of all threads, which makes debugging the issue harder.
> > > >
> > > > I would suggest that we should still allow timeout because the pros
> > > > outweigh the cons.
> > > >
> > > > As far as I can tell, if we allow timeout and encounter a deadlock
> bug
> > in
> > > > AZP, we still know which test (or test suite) fails. There is a good
> > > chance
> > > > we can reproduce the deadlock locally (by running it 100 times) and
> get
> > > the
> > > > debug information we need. In the rare case where the deadlock
> happens
> > > only
> > > > on AZP, we can just disable the timeout for that particular test. So
> > the
> > > > lack of thread dump is not really a concern.
> > > >
> > > > On the other hand, if we disallow timeout, it will be very hard for
> us
> > to
> > > > catch low-quality tests. I don't know if there is a good alternative
> > way
> > > to
> > > > catch those tests.
> > > >
> > > >
> > > >
> > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <
> > dwysakow...@apache.org
> > > >
> > > > wrote:
> > > >
> > > >> Hi devs!
> > > >>
> > > >> I wanted to bring up something that was discussed in a few
> independent
> > > >> groups of people in the past days. I'd like to revise using timeouts
> > in
> > > >> our JUnit tests. The suggestion would be not to use them anymore.
> The
> > > >> problem with timeouts is that we have no thread dump and stack
> traces
> > of
> > > >> the system as it hangs. If we were not using a timeout, the CI
> runner
> > > >> would have caught the timeout and created a thread dump which often
> > is a
> > > >> great starting point for debugging.
> > > >>
> > > >> This problem has been spotted e.g. during debugging FLINK-22416[1].
> In
> > > >> the past thread dumps were not always taken for hanging tests, but
> it
> > > >> was changed quite recently in FLINK-21346[2]. I am happy to hear
> your
> > > >> opinions on it. If there are no objections I would like to add the
> > > >> suggestion to the Coding Guidelines[3]
> > > >>
> > > >> Best,
> > > >>
> > > >> Dawid
> > > >>
> > > >>
> > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416
> > > >>
> > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346
> > > >>
> > > >> [3]
> > > >>
> > > >>
> > >
> >
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> > > >>
> > > >>
> > > >>
> > >
> >
>

Reply via email to