I think for the maven tests we use this log4j.properties file [1]. [1] https://github.com/apache/flink/blob/master/tools/ci/log4j.properties
Cheers, Till On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <lindon...@gmail.com> wrote: > Thanks for the detailed explanations! Regarding the usage of timeout, now I > agree that it is better to remove per-test timeouts because it helps > make our testing results more reliable and consistent. > > My previous concern is that it might not be a good idea to intentionally > let the test hang in AZP in order to get the thread dump. Now I get that > there are a few practical concerns around the usage of timeout which makes > testing results unreliable (e.g. flakiness in the presence of VM > migration). > > Regarding the level logging on AZP, it appears that we actually set > "rootLogger.level = OFF" in most log4j2-test.properties, which means that > no INFO log would be printed on AZP. For example, I tried to increase the > log level in this <https://github.com/apache/flink/pull/15617> PR and was > suggested in this > < > https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055 > > > comment to avoid increasing the log level. Did I miss something here? > > > On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <ar...@apache.org> wrote: > > > Just to add to Dong Lin's list of cons of allowing timeout: > > - Any timeout value that you manually set is arbitrary. If it's set too > > low, you get test instabilities. What too low means depends on numerous > > factors, such as hardware and current utilization (especially I/O). If > you > > run in VMs and the VM is migrated while running a test, any reasonable > > timeout will probably fail. While you could make a similar case for the > > overall timeout of tests, any smaller hiccup in the range of minutes will > > not impact the overall runtime much. The probability of having a VM > > constantly migrating during the same stage is abysmally low. > > - A timeout is more maintenance-intensive. It's one more knob where you > can > > tweak a build or not. If you change the test a bit, you also need to > > double-check the timeout. Hence, there have been quite a few commits that > > just increase timeouts. > > - Whether a test uses a timeout or not is arbitrary: Why do some ITs > have a > > timeout and others don't? All IT tests are prone to timeout if there are > > issues with resource allocation. Similarly, there are quite a few unit > > tests with timeouts while others don't have them with no obvious pattern. > > - An ill-set timeout reduces build reproducibility. Imagine having a > > release with such a timeout and the users cannot build Flink reliably. > > > > I'd like to also point out that we should not cater around unstable tests > > if our overall goal is to have as many green builds as possible. If we > > assume that our builds fail more often than not, we should also look into > > the other direction and continue the builds on error. I'm not a big fan > of > > that. > > > > One argument that I also heard is that it eases local debugging in case > of > > refactorings as you can see multiple failures at the same time. But no > one > > is keeping you from temporarily adding a timeout on your branch. Then, we > > can be sure that the timeout is plausible for your hardware and avoid all > > above mentioned drawbacks. > > > > @Robert Metzger <rmetz...@apache.org> > > > > > If we had a global limit of 1 minute per test, we would have caught > this > > > case (and we would encourage people to be careful with CI time). > > > > > There are quite a few tests that run longer, especially on a well > utilized > > build machine. A global limit is even worse than individual limits as > there > > is no value that fits it all. If you screwed up and 200 tests hang, you'd > > also run into the global timeout anyway. I'm also not sure what these > > additional hangs bring you except a huge log. > > > > I'm also not sure if it's really better in terms of CI time. For example, > > for UnalignedCheckpointRescaleITCase, we test all known partitioners in > one > > pipeline for correctness. For higher parallelism, that means the test > runs > > over 1 minute regularly. If there is a global limit, I'd need to split > the > > test into smaller chunks, where I'm positive that the sum of the chunks > > will be larger than before. > > > > PS: all tests on AZP will print INFO in the artifacts. There you can also > > retrieve the stacktraces. > > PPS: I also said that we should revalidate the current timeout on AZP. So > > the argument that we have >2h of precious CI time wasted is kind of > > constructed and is just due to some random defaults. > > > > On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <trohrm...@apache.org> > > wrote: > > > > > I think we do capture the INFO logs of the test runs on AZP. > > > > > > I am also not sure whether we really caught slow tests with Junit's > > timeout > > > rule before. I think the default is usually to increase the timeout to > > make > > > the test pass. One way to find slow tests is to measure the time and > look > > > at the outliers. > > > > > > Cheers, > > > Till > > > > > > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <lindon...@gmail.com> wrote: > > > > > > > There is one more point that may be useful to consider here. > > > > > > > > In order to debug deadlock that is not easily reproducible, it is > > likely > > > > not sufficient to see only the thread dump to figure out the root > > cause. > > > We > > > > likely need to enable the INFO level logging. Since AZP does not > > provide > > > > INFO level logging by default, we either need to reproduce the bug > > > locally > > > > or change the AZP log4j temporarily. This further reduces the benefit > > of > > > > logging the thread dump (which comes at the cost of letting the AZP > job > > > > hang). > > > > > > > > > > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <lindon...@gmail.com> > wrote: > > > > > > > > > Just to make sure I understand the proposal correctly: is the > > proposal > > > to > > > > > disallow the usage of @Test(timeout=...) for Flink Junit tests? > > > > > > > > > > Here is my understanding of the pros/cons according to the > discussion > > > so > > > > > far. > > > > > > > > > > Pros of allowing timeout: > > > > > 1) When there are tests that are unreasonably slow, it helps us > > > > > catch those tests and thus increase the quality of unit tests. > > > > > 2) When there are tests that cause deadlock, it helps the AZP job > > fail > > > > > fast instead of being blocked for 4 hours. This saves resources and > > > also > > > > > allows developers to get their PR tested again earlier (useful when > > the > > > > > test failure is not relevant to their PR). > > > > > > > > > > Cons of allowing timeout: > > > > > 1) When there are tests that cause deadlock, we could not see the > > > thread > > > > > dump of all threads, which makes debugging the issue harder. > > > > > > > > > > I would suggest that we should still allow timeout because the pros > > > > > outweigh the cons. > > > > > > > > > > As far as I can tell, if we allow timeout and encounter a deadlock > > bug > > > in > > > > > AZP, we still know which test (or test suite) fails. There is a > good > > > > chance > > > > > we can reproduce the deadlock locally (by running it 100 times) and > > get > > > > the > > > > > debug information we need. In the rare case where the deadlock > > happens > > > > only > > > > > on AZP, we can just disable the timeout for that particular test. > So > > > the > > > > > lack of thread dump is not really a concern. > > > > > > > > > > On the other hand, if we disallow timeout, it will be very hard for > > us > > > to > > > > > catch low-quality tests. I don't know if there is a good > alternative > > > way > > > > to > > > > > catch those tests. > > > > > > > > > > > > > > > > > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz < > > > dwysakow...@apache.org > > > > > > > > > > wrote: > > > > > > > > > >> Hi devs! > > > > >> > > > > >> I wanted to bring up something that was discussed in a few > > independent > > > > >> groups of people in the past days. I'd like to revise using > timeouts > > > in > > > > >> our JUnit tests. The suggestion would be not to use them anymore. > > The > > > > >> problem with timeouts is that we have no thread dump and stack > > traces > > > of > > > > >> the system as it hangs. If we were not using a timeout, the CI > > runner > > > > >> would have caught the timeout and created a thread dump which > often > > > is a > > > > >> great starting point for debugging. > > > > >> > > > > >> This problem has been spotted e.g. during debugging > FLINK-22416[1]. > > In > > > > >> the past thread dumps were not always taken for hanging tests, but > > it > > > > >> was changed quite recently in FLINK-21346[2]. I am happy to hear > > your > > > > >> opinions on it. If there are no objections I would like to add the > > > > >> suggestion to the Coding Guidelines[3] > > > > >> > > > > >> Best, > > > > >> > > > > >> Dawid > > > > >> > > > > >> > > > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416 > > > > >> > > > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346 > > > > >> > > > > >> [3] > > > > >> > > > > >> > > > > > > > > > > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > > > > >> > > > > >> > > > > >> > > > > > > > > > >