What exactly is the issue? I've been working on Spark dev for a long time
and very rarely do I actually run into an issue that only manifest on
Jenkins but not locally. I don't have some magic local setup either.

We should definitely cut down test flakiness.


On Thu, Feb 16, 2017 at 5:26 PM, Saikat Kanjilal <sxk1...@hotmail.com>
wrote:

> I'd just like to follow up again on this thread, should we devote some
> energy to fixing unit tests based on module, there wasn't much interest in
> this last time but given the nature of this thread I'd be willing to deep
> dive into this again with some help.
> ------------------------------
> *From:* Saikat Kanjilal <sxk1...@hotmail.com>
> *Sent:* Wednesday, February 15, 2017 6:12 PM
> *To:* Josh Rosen
> *Cc:* Armin Braun; Kay Ousterhout; dev@spark.apache.org
>
> *Subject:* Re: File JIRAs for all flaky test failures
>
> The issue was not with a lack of tooling, I used the url you are
> describing below to drill down to the exact test failure/stack trace, the
> problem was that my builds would work like a charm locally but fail with
> these errors on Jenkins, this was the whole challenge in fixing the unit
> tests, it was rare (if ever) where I would be able to replicate test
> failures locally.
>
> Sent from my iPhone
>
> On Feb 15, 2017, at 5:40 PM, Josh Rosen <joshro...@databricks.com> wrote:
>
> A useful tool for investigating test flakiness is my Jenkins Test Explorer
> service, running at https://spark-tests.appspot.com/
>
> This has some useful timeline views for debugging flaky builds. For
> instance, at https://spark-tests.appspot.com/jobs/spark-master-
> test-maven-hadoop-2.6 (may be slow to load) you can see this chart:
> https://i.imgur.com/j8LV3pX.png. Here, each column represents a test run
> and each row represents a test which failed at least once over the
> displayed time period.
>
> In that linked example screenshot you'll notice that a few columns have
> grey squares indicating that tests were skipped but lack any red squares to
> indicate test failures. This usually indicates that the build failed due to
> a problem other than an individual test failure. For example, I clicked
> into one of those builds and found that one test suite failed in test setup
> because the previous suite had not properly cleaned up its SparkContext
> (I'll file a JIRA for this).
>
> You can click through the interface to drill down to reports on individual
> builds, tests, suites, etc. As an example of an individual test's detail
> page, https://spark-tests.appspot.com/test-details?
> suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_
> name=missing+checkpoint+block+fails+with+informative+message shows the
> patterns of flakiness in a streaming checkpoint test.
>
> Finally, there's an experimental "interesting new test failures" report
> which tries to surface tests which have started failing very recently:
> https://spark-tests.appspot.com/failed-tests/new. Specifically, entries
> in this feed are test failures which a) occurred in the last week, b) were
> not part of a build which had 20 or more failed tests, and c) were not
> observed to fail in during the previous week (i.e. no failures from [2
> weeks ago, 1 week ago)), and d) which represent the first time that the
> test failed this week (i.e. a test case will appear at most once in the
> results list). I've also exposed this as an RSS feed at
> https://spark-tests.appspot.com/rss/failed-tests/new.
>
>
> On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal <sxk1...@hotmail.com>
> wrote:
>
> I would recommend we just open JIRA's for unit tests based on module
> (core/ml/sql etc) and we fix this one module at a time, this at least keeps
> the number of unit tests needing fixing down to a manageable number.
>
>
> ------------------------------
> *From:* Armin Braun <m...@obrown.io>
> *Sent:* Wednesday, February 15, 2017 12:48 PM
> *To:* Saikat Kanjilal
> *Cc:* Kay Ousterhout; dev@spark.apache.org
> *Subject:* Re: File JIRAs for all flaky test failures
>
> I think one thing that is contributing to this a lot too is the general
> issue of the tests taking up a lot of file descriptors (10k+ if I run them
> on a standard Debian machine).
> There are a few suits that contribute to this in particular like
> `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few
> others, appears to consume a lot of fds.
>
> Wouldn't it make sense to open JIRAs about those and actively try to
> reduce the resource consumption of these tests?
> Seems to me these can cause a lot of unpredictable behavior (making the
> reason for flaky tests hard to identify especially when there's timeouts
> etc. involved) + they make it prohibitively expensive for many to test
> locally imo.
>
> On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <sxk1...@hotmail.com>
> wrote:
>
> I was working on something to address this a while ago
> https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in
> testing locally made things a lot more complicated to fix for each of the
> unit tests, should we resurface this JIRA again, I would whole heartedly
> agree with the flakiness assessment of the unit tests.
> [SPARK-9487] Use the same num. worker threads in Scala ...
> <https://issues.apache.org/jira/browse/SPARK-9487>
> issues.apache.org
> In Python we use `local[4]` for unit tests, while in Scala/Java we use
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other
> components. If the ...
>
>
>
> ------------------------------
> *From:* Kay Ousterhout <kayousterh...@gmail.com>
> *Sent:* Wednesday, February 15, 2017 12:10 PM
> *To:* dev@spark.apache.org
> *Subject:* File JIRAs for all flaky test failures
>
> Hi all,
>
> I've noticed the Spark tests getting increasingly flaky -- it seems more
> common than not now that the tests need to be re-run at least once on PRs
> before they pass.  This is both annoying and problematic because it makes
> it harder to tell when a PR is introducing new flakiness.
>
> To try to clean this up, I'd propose filing a JIRA *every time* Jenkins
> fails on a PR (for a reason unrelated to the PR).  Just provide a quick
> description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or
> "Tests failed because 250m timeout expired", a link to the failed build,
> and include the "Tests" component.  If there's already a JIRA for the
> issue, just comment with a link to the latest failure.  I know folks don't
> always have time to track down why a test failed, but this it at least
> helpful to someone else who, later on, is trying to diagnose when the issue
> started to find the problematic code / test.
>
> If this seems like too high overhead, feel free to suggest alternative
> ways to make the tests less flaky!
>
> -Kay
>
>
>

Reply via email to