I would recommend we just open JIRA's for unit tests based on module 
(core/ml/sql etc) and we fix this one module at a time, this at least keeps the 
number of unit tests needing fixing down to a manageable number.


________________________________
From: Armin Braun <m...@obrown.io>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; dev@spark.apache.org
Subject: Re: File JIRAs for all flaky test failures

I think one thing that is contributing to this a lot too is the general issue 
of the tests taking up a lot of file descriptors (10k+ if I run them on a 
standard Debian machine).
There are a few suits that contribute to this in particular like 
`org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, 
appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the 
resource consumption of these tests?
Seems to me these can cause a lot of unpredictable behavior (making the reason 
for flaky tests hard to identify especially when there's timeouts etc. 
involved) + they make it prohibitively expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal 
<sxk1...@hotmail.com<mailto:sxk1...@hotmail.com>> wrote:

I was working on something to address this a while ago 
https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing 
locally made things a lot more complicated to fix for each of the unit tests, 
should we resurface this JIRA again, I would whole heartedly agree with the 
flakiness assessment of the unit tests.

[SPARK-9487] Use the same num. worker threads in Scala 
...<https://issues.apache.org/jira/browse/SPARK-9487>
issues.apache.org<http://issues.apache.org>
In Python we use `local[4]` for unit tests, while in Scala/Java we use 
`local[2]` and `local` for some unit tests in SQL, MLLib, and other components. 
If the ...




________________________________
From: Kay Ousterhout <kayousterh...@gmail.com<mailto:kayousterh...@gmail.com>>
Sent: Wednesday, February 15, 2017 12:10 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: File JIRAs for all flaky test failures

Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common 
than not now that the tests need to be re-run at least once on PRs before they 
pass.  This is both annoying and problematic because it makes it harder to tell 
when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails 
on a PR (for a reason unrelated to the PR).  Just provide a quick description 
of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests failed 
because 250m timeout expired", a link to the failed build, and include the 
"Tests" component.  If there's already a JIRA for the issue, just comment with 
a link to the latest failure.  I know folks don't always have time to track 
down why a test failed, but this it at least helpful to someone else who, later 
on, is trying to diagnose when the issue started to find the problematic code / 
test.

If this seems like too high overhead, feel free to suggest alternative ways to 
make the tests less flaky!

-Kay

Reply via email to