Re: Fixing flaky tests in Apache Hadoop

Ahmed Hussein Thu, 22 Oct 2020 21:54:34 -0700

Thank you Akira and We-Chiu.
IMHO, the citation is more than just flaky tests. It has more depth:
- Every developer stays committed to keep the code healthy.
- Those flaky tests are actually "*bugs*" that need to be fixed. It is
evident that there is a major problem in handling
  the resources as I will explain below.


1. Other projects such as HBase have a tool to exclude flaky tests from
> being executed. They track flaky tests and display them in a dashboard.
> This will allow good tests to pass while leaving time for folks to fix
> them. Or we could manually exclude tests (this is what we used to do at
> Cloudera)
>

I like the idea of having a tool that gives a view of broken tests.

 I spent a long time converting HDFS flaky tests into sub-tasks under
HDFS-15646 <https://issues.apache.org/jira/browse/HDFS-15646>. I
believe there are still tons
on the loose.
I remember I explored a tool called DeFlaker
<https://www.jonbell.net/icse18-deflaker.pdf> which detects flaky tests.
Then it reruns the tests to verify that they still
pass.

I do not think we necessarily want to exclude the flaky tests, but at least
they should be enumerated and addressed
regularly because they are after all "bugs". Having few flaky tests that
cause everything to blow up indicates
that there is a major problem with handling resources.
I pointed out this issue in YARN-10334
<https://issues.apache.org/jira/browse/YARN-10334> where I found that
TestDistributedShell is nothing but a black hole that sucks
all the resources memory/cpu/port of resources.
Another example, I ran some few Unit tests on my local machine. In less
than an hour, I found that there are 6 java
processes still listening to ports.

The point is flaky tests should not be *undermined* for such a long time as
they could be indicators of a serious bug.
In this current situation, we should find what is eating all those
resources.

2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>  this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)
>
This sounds fun and I like it actually but I doubt it is feasible to apply
:)

I also wondered if the hardware was too stressed since all Hadoop related
> projects all use the same set of Jenkins servers.
> However, HBase just recently moved to their own dedicated machines, so I'm
> actually surprised to see a lot of resource related failures even now.
>
As I mentioned in my response on the first point, a black-hole is created
once the tests are triggered.
I could not even run TestDistibutedShell on my local machine. The tests run
out of everything  after the first 11 unit tests.
It takes only 1 unit to fail to break the rest.

On Thu, Oct 22, 2020 at 5:28 PM Wei-Chiu Chuang <weic...@apache.org> wrote:

> I also wondered if the hardware was too stressed since all Hadoop related
> projects all use the same set of Jenkins servers.
> However, HBase just recently moved to their own dedicated machines, so I'm
> actually surprised to see a lot of resource related failures even now.
>
> On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <weic...@apache.org>
> wrote:
>
> > Thanks for raising the issue, Akira and Ahmed,
> >
> > Fixing flaky tests is a thankless job so I want to take this opportunity
> > to recognize the time and effort.
> >
> > We will always have flaky tests due to bad tests or simply infra issues.
> > Fixing flaky tests will take time but if they are not addressed it wastes
> > everybody's time.
> >
> > Recognizing this problem, I have two suggestions:
> >
> > 1. Other projects such as HBase have a tool to exclude flaky tests from
> > being executed. They track flaky tests and display them in a dashboard.
> > This will allow good tests to pass while leaving time for folks to fix
> > them. Or we could manually exclude tests (this is what we used to do at
> > Cloudera)
> >
> > 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> > day two years ago, and maybe it's time to repeat it again:
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
> this
> > is going to be tricky as we are in a pandemic and most of the community
> are
> > working from home, unlike the last time when we can lock ourselves in a
> > conference room and force everybody to work :)
> >
> > Thoughts?
> >
> >
> > On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aajis...@apache.org>
> > wrote:
> >
> >> Hi Hadoop developers,
> >>
> >> Now there are a lot of failing unit tests and there is an issue to
> >> tackle this bad situation.
> >> https://issues.apache.org/jira/browse/HDFS-15646
> >>
> >> Although this issue is in HDFS project, this issue is related to all
> >> the Hadoop developers. Please check the above URL, read the
> >> description, and volunteer to dedicate more time to fix flaky tests.
> >> Your contribution to fixing the flaky tests will be really
> >> appreciated!
> >>
> >> Thank you Ahmed Hussein for your report.
> >>
> >> Regards,
> >> Akira
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
> >> For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
> >>
> >>
>


-- 
Best Regards,

*Ahmed Hussein, PhD*

Re: Fixing flaky tests in Apache Hadoop

Reply via email to