[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-24 Thread Pam Greene
On Sat, Aug 22, 2009 at 4:29 PM, Jeremy Orlow jor...@chromium.org wrote:

 On Sat, Aug 22, 2009 at 4:00 PM, Peter Kasting pkast...@chromium.orgwrote:

 On Fri, Aug 21, 2009 at 8:07 PM, Jeremy Orlow jor...@chromium.orgwrote:


 1) We don't have notes on why tests are failing.  =  Why not annotate
 the tests in test_lists?  That's what I've always done.


 Once again, we don't want to add more state to the test_expectations.
  How may people looked up the tests they were supposed to rebaseline in this
 file to see if there were notes?  I kind of doubt anyone.


 Um... this makes no sense to me.  You can't rebaseline a test without
 modifying test_expectations.  In modifying it, you *have* to look at it.
  It's pretty difficult to miss comments above tests as you're trying to
 write REBASELINE or delete the line.

 If you somehow managed to not see any comments in this file, I think
 you're an outlier.


 I was talking about the rebaselining teams, not the act of actually
 rebaselining.  If someone's rebaselining a test, then it means we now
 believe it's passing.  At that point, the notes are not very interesting,
 right?  Are you saying that you looked at all the tests' notes before you
 looked through the results to determine if they should be rebaselined?


We're trying to leave all comments in the bugs now, rather than in the
test_expectations file, so there's only one point of contact. We used to
leave extensive comments in the file, but they always grew stale. And yes, I
looked at the bug for every test that I thought was correct, usually to
write tests A, B and C are still bad, but D was actually correct and is
being re-baselined.





 There are different reasons for failing.  A layout test could be failing
 because of a known bug and then start failing in a different way (later) due
 to a regression.  When a bug fails in a new way, it's worth taking a quick
 look, I think.


 Why?  Unless the earlier failure has been fixed we can't rebaseline the
 test.  (I ran into a number of tests like this when doing my rebaselining
 pass.)  What is the point of looking again?


 In case the new failure is more serious than the earlier one.


True. But I don't think this will happen often, and I'd rather devote the
time to fixing the tests.

- Pam

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-24 Thread Ojan Vafai
On Mon, Aug 24, 2009 at 10:37 AM, Pam Greene p...@chromium.org wrote:

 On Sat, Aug 22, 2009 at 4:29 PM, Jeremy Orlow jor...@chromium.org wrote:

 On Sat, Aug 22, 2009 at 4:00 PM, Peter Kasting pkast...@chromium.orgwrote:

 On Fri, Aug 21, 2009 at 8:07 PM, Jeremy Orlow jor...@chromium.orgwrote:

  There are different reasons for failing.  A layout test could be failing
 because of a known bug and then start failing in a different way (later) 
 due
 to a regression.  When a bug fails in a new way, it's worth taking a quick
 look, I think.


 Why?  Unless the earlier failure has been fixed we can't rebaseline the
 test.  (I ran into a number of tests like this when doing my rebaselining
 pass.)  What is the point of looking again?


 In case the new failure is more serious than the earlier one.


 True. But I don't think this will happen often, and I'd rather devote the
 time to fixing the tests.


The end goal is to be in a state where we have near zero failing tests that
are not for unimplemented features. And new failures from the merge get
addressed within a week.

Once we're at that point, would this new infrastructure be useful? I
completely support infrastructure that sustainably supports us being at near
zero failing tests (e.g. the rebaseline tool). All infrastructure/process
has a maintenance cost though.

Ojan

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-24 Thread Dirk Pranke

On Mon, Aug 24, 2009 at 11:37 AM, Ojan Vafaio...@chromium.org wrote:
 The end goal is to be in a state where we have near zero failing tests that
 are not for unimplemented features. And new failures from the merge get
 addressed within a week.
 Once we're at that point, would this new infrastructure be useful? I
 completely support infrastructure that sustainably supports us being at near
 zero failing tests (e.g. the rebaseline tool). All infrastructure/process
 has a maintenance cost though.

True enough. There are at least two counterexamples that are worth
considering. The first is that probably won't be at zero failing tests
any time soon (where any time soon == next 3-6 months), and so there
may be intermediary value. The second is that we have a policy of
running every test, even tests for unimplemented features, and so we
may catch regressions for the foreseeable future.

That said, I don't know if the value will offset the cost. Hence the
desire to run a couple of cheap experiments :)

-- Dirk

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-24 Thread Dirk Pranke

On Mon, Aug 24, 2009 at 1:52 PM, David Levinle...@google.com wrote:


 On Mon, Aug 24, 2009 at 1:37 PM, Dirk Pranke dpra...@chromium.org wrote:

 On Mon, Aug 24, 2009 at 11:37 AM, Ojan Vafaio...@chromium.org wrote:
  The end goal is to be in a state where we have near zero failing tests
  that
  are not for unimplemented features. And new failures from the merge get
  addressed within a week.
  Once we're at that point, would this new infrastructure be useful? I
  completely support infrastructure that sustainably supports us being at
  near
  zero failing tests (e.g. the rebaseline tool). All
  infrastructure/process
  has a maintenance cost though.

 True enough. There are at least two counterexamples that are worth
 considering. The first is that probably won't be at zero failing tests
 any time soon (where any time soon == next 3-6 months), and so there
 may be intermediary value. The second is that we have a policy of
 running every test, even tests for unimplemented features, and so we
 may catch regressions for the foreseeable future.

 That said, I don't know if the value will offset the cost. Hence the
 desire to run a couple of cheap experiments :)

 What do the cheap experiments entail?  Key concern: If the cheapness is to
 put more work on the webkit gardeners, it isn't cheap at all imo.


Cheap experiments == me snapshotting the results of tests I run
periodically and comparing them. No work for anyone else.

-- Dirk

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-24 Thread David Levin
On Mon, Aug 24, 2009 at 1:37 PM, Dirk Pranke dpra...@chromium.org wrote:


 On Mon, Aug 24, 2009 at 11:37 AM, Ojan Vafaio...@chromium.org wrote:
  The end goal is to be in a state where we have near zero failing tests
 that
  are not for unimplemented features. And new failures from the merge get
  addressed within a week.
  Once we're at that point, would this new infrastructure be useful? I
  completely support infrastructure that sustainably supports us being at
 near
  zero failing tests (e.g. the rebaseline tool). All infrastructure/process
  has a maintenance cost though.

 True enough. There are at least two counterexamples that are worth
 considering. The first is that probably won't be at zero failing tests
 any time soon (where any time soon == next 3-6 months), and so there
 may be intermediary value. The second is that we have a policy of
 running every test, even tests for unimplemented features, and so we
 may catch regressions for the foreseeable future.

 That said, I don't know if the value will offset the cost. Hence the
 desire to run a couple of cheap experiments :)


What do the cheap experiments entail?  Key concern: If the cheapness is to
put more work on the webkit gardeners, it isn't cheap at all imo.





 -- Dirk

 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-23 Thread Dimitri Glazkov

On Sat, Aug 22, 2009 at 9:51 PM, Jeremy Orlowjor...@chromium.org wrote:
 It might be worth going through all the LayoutTest bugs and double check
 they're split up into individual root causes (or something approximating
 that).  I'll try to make time to do a scan in the next week or so, but it'd
 be great if anyone else had time to help.  :-)

I've been doing this last week. Maybe we could figure out how to do
this in parallel on Monday?

:DG

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-23 Thread Dimitri Glazkov

I understand the resistance to implement yet another bit of process
and effort around layout tests. I really do. However, I found some
merit in Dirk's idea -- it allows us to clearly see the impact of a
regression.

Sadly, I can't come up with a specific example at the moment, but let
me pull one out of my ... hat, based on previous experiences. Let's
say we had a regression in JSON parsing. But since we already fail
parts of the LayoutTests/fast/js/JSON-parse.html, we wouldn't notice
it. Especially with DOM bindings, there are tons of tests like this --
we pass only parts of them, so we wouldn't know when our changes or
commits upstream introduce regressions that we really ought to be
noticing.

It's kind of like marking layout tests as flakey: there's no way to
determine whether the flakiness is gone other than by recording some
extra information.

So to me at least, the benefit of this type of solution is not
near-zero. My only hesitation comes from having to decide whether we
should stop and implement this rather than dedicate all of our
resources to plowing ahead in fixing layout tests and driving the
number to 0 (and thus eliminating the need for this solution).

:DG

On Fri, Aug 21, 2009 at 6:43 PM, Ojan Vafaio...@chromium.org wrote:
 On Fri, Aug 21, 2009 at 4:54 PM, Peter Kasting pkast...@chromium.org
 wrote:

 On Fri, Aug 21, 2009 at 4:50 PM, Dirk Pranke dpra...@chromium.org wrote:

 This is all good feedback, thanks! To clarify, though: what do you
 think the cost will be? Perhaps you are assuming things about how I
 would implement this that are different than what I had in mind.

 Some amount of your time, and some amount of space on the bots.

 Also, some amount of the rest of the team's time to follow this process.
 Ojan
 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-22 Thread Peter Kasting
On Fri, Aug 21, 2009 at 8:07 PM, Jeremy Orlow jor...@chromium.org wrote:


 1) We don't have notes on why tests are failing.  =  Why not annotate
 the tests in test_lists?  That's what I've always done.


 Once again, we don't want to add more state to the test_expectations.  How
 may people looked up the tests they were supposed to rebaseline in this file
 to see if there were notes?  I kind of doubt anyone.


Um... this makes no sense to me.  You can't rebaseline a test without
modifying test_expectations.  In modifying it, you *have* to look at it.
 It's pretty difficult to miss comments above tests as you're trying to
write REBASELINE or delete the line.

If you somehow managed to not see any comments in this file, I think you're
an outlier.

There are different reasons for failing.  A layout test could be failing
 because of a known bug and then start failing in a different way (later) due
 to a regression.  When a bug fails in a new way, it's worth taking a quick
 look, I think.


Why?  Unless the earlier failure has been fixed we can't rebaseline the
test.  (I ran into a number of tests like this when doing my rebaselining
pass.)  What is the point of looking again?

PK

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-22 Thread Jeremy Orlow
On Sat, Aug 22, 2009 at 4:00 PM, Peter Kasting pkast...@chromium.orgwrote:

 On Fri, Aug 21, 2009 at 8:07 PM, Jeremy Orlow jor...@chromium.org wrote:


 1) We don't have notes on why tests are failing.  =  Why not annotate
 the tests in test_lists?  That's what I've always done.


 Once again, we don't want to add more state to the test_expectations.  How
 may people looked up the tests they were supposed to rebaseline in this file
 to see if there were notes?  I kind of doubt anyone.


 Um... this makes no sense to me.  You can't rebaseline a test without
 modifying test_expectations.  In modifying it, you *have* to look at it.
  It's pretty difficult to miss comments above tests as you're trying to
 write REBASELINE or delete the line.

 If you somehow managed to not see any comments in this file, I think you're
 an outlier.


I was talking about the rebaselining teams, not the act of actually
rebaselining.  If someone's rebaselining a test, then it means we now
believe it's passing.  At that point, the notes are not very interesting,
right?  Are you saying that you looked at all the tests' notes before you
looked through the results to determine if they should be rebaselined?



 There are different reasons for failing.  A layout test could be failing
 because of a known bug and then start failing in a different way (later) due
 to a regression.  When a bug fails in a new way, it's worth taking a quick
 look, I think.


 Why?  Unless the earlier failure has been fixed we can't rebaseline the
 test.  (I ran into a number of tests like this when doing my rebaselining
 pass.)  What is the point of looking again?


In case the new failure is more serious than the earlier one.

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-22 Thread Jeremy Orlow
On Sat, Aug 22, 2009 at 5:54 PM, Peter Kasting pkast...@chromium.orgwrote:

 On Sat, Aug 22, 2009 at 4:29 PM, Jeremy Orlow jor...@chromium.org wrote:

  On Sat, Aug 22, 2009 at 4:00 PM, Peter Kasting pkast...@chromium.orgwrote:

 If you somehow managed to not see any comments in this file, I think
 you're an outlier.


 I was talking about the rebaselining teams, not the act of actually
 rebaselining.  If someone's rebaselining a test, then it means we now
 believe it's passing.  At that point, the notes are not very interesting,
 right?  Are you saying that you looked at all the tests' notes before you
 looked through the results to determine if they should be rebaselined?


 I certainly looked at them during the process of determining what was going
 on, and left several notes of my own.

 I don't think I understand your objection.  Are you saying notes are
 useless or that they're harmful?  I don't think either is true.  If you're
 trying to determine how to fix a layout test, the notes in the file are one
 of the first things you see, because you're looking in the file to find the
 bug #, what OSes are affected, etc.  At that point notes that say what to
 look for are useful. If you're trying to determine whether to rebaseline a
 test, notes are at worst harmless and at best useful in pointing out some
 subtlety that you overlooked if you'd already made your decision.  You HAVE
 to see the notes because you HAVE to edit the file.

 Notes in test_expectations.txt are like comments in source code: A great
 boon.


I've herd differing opinions, but you're the definitely the most gung-ho
I've talked to about notes in the test_expectations.txt file.  Typically
bugs are where most if not all of the information on failures should be
kept.  If there is information in the test_expectations.txt file, it should
certainly be a subset of the information in the bugs, would you not agree?



  There are different reasons for failing.  A layout test could be failing
 because of a known bug and then start failing in a different way (later) 
 due
 to a regression.  When a bug fails in a new way, it's worth taking a quick
 look, I think.


 Why?  Unless the earlier failure has been fixed we can't rebaseline the
 test.  (I ran into a number of tests like this when doing my rebaselining
 pass.)  What is the point of looking again?


 In case the new failure is more serious than the earlier one.


 The only possible reason I could think that would matter is if we're using
 this as a source of triage input into which bugs we should fix first.  But
 we have so many thousands of bugs, nearly all likely to be higher priority
 than a second failure in a test we already haven't prioritized fixing, that
 I don't consider this a valuable signal.


I suppose that is true.

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-22 Thread Peter Kasting
On Sat, Aug 22, 2009 at 7:49 PM, Jeremy Orlow jor...@chromium.org wrote:

 On Sat, Aug 22, 2009 at 5:54 PM, Peter Kasting pkast...@chromium.orgwrote:

 Notes in test_expectations.txt are like comments in source code: A great
 boon.


 I've herd differing opinions, but you're the definitely the most gung-ho
 I've talked to about notes in the test_expectations.txt file.  Typically
 bugs are where most if not all of the information on failures should be
 kept.  If there is information in the test_expectations.txt file, it should
 certainly be a subset of the information in the bugs, would you not agree?


Yes, that is ideal.  One nice thing about comments in the test_expectations
file is that unlike comments in bugs, they're (a) hard to miss and (b)
unlikely to be drowned by a sea of bugdroid comments and other spew.  Also,
frequently tests with completely different failures get grouped into one bug
(merge failures r1-r2) and comments on the tests can help add clarity
(although splitting these into multiple bugs is also advisable).

PK

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-21 Thread Jeremy Orlow
On Fri, Aug 21, 2009 at 1:00 PM, Dirk Pranke dpra...@chromium.org wrote:


 Hi all,

 As Glenn noted, we made great progress last week in rebaselining the
 tests. Unfortunately, we don't have a mechanism to preserve the
 knowledge we gained last week as to whether or not tests need to be
 rebaselined or not, and why not. As a result, it's easy to imagine
 that we'd need to repeat this process every few months.

 I've written up a proposal for preventing this from happening again,
 and I think it will also help us notice more regressions in the
 future. Check out:


 http://dev.chromium.org/developers/design-documents/handlinglayouttestexpectations

 Here's the executive summary from that document:

 We have a lot of layout test failures. For each test failure, we have
 no good way of tracking whether or not someone has looked at the test
 output lately, and whether or not the test output is still broken or
 should be rebaselined. We just went through a week of rebaselining,
 and stand a good chance of needing to do that again in a few months
 and losing all of the knowledge that was captured last week.

 So, I propose a way to capture the current broken output from
 failing tests, and to version control them so that we can tell when a
 test's output changes from one expected failing result to another.
 Such a change may reflect that there has been a regression, or that
 the bug has been fixed and the test should be rebaselined.

 Changes

 We modify the layout test scripts to check for 'foo-bad' as well as
 'foo-expected'. If the output of test foo does not match
 'foo-expected', then we check to see if it matches 'foo-bad'. If it
 does, then we treat it as we treat test failures today, except that
 there is no need to save the failed test result (since a version of
 the output is already checked in). Note that although -bad is
 similar to a different platform, we cannot actually use a different
 platform, since we actually need up to N different -bad versions,
 one for each supported platform that a test fails on.
 We check in a set of '*-bad' baselines based on current output from
 the regressions. In theory, they should all be legitimate.
 We modify the test to also report regressions from the *-bad
 baselines. In the cases where we know the failing test is also flaky
 or nondeterministic, we can indicate that as NDFAIL in test
 expectations to distinguish from a regular deterministic FAIL.
 We modify the rebaselining tools to handle *-bad output as well as
 *-expected.
 Just like we require each test failure to be associated with a bug, we
 require each *-bad output to be associated with a bug - normally
 (always?) the same bug. The bug should contain comments about what the
 difference is between the broken output and the expected output, and
 why it's different, e.g., something like Note that the text is in two
 lines in the -bad output, and it should be all on the same line
 without wrapping.
 The same approach can be used here to justify platform-specific
 variances in output, if we decide to become even more picky about
 this, but I suggest we learn to walk before we try to run.
 Eventually (?) we modify the layout test scripts themselves to fail if
 the *-bad baselines aren't matched.

 Let me know what you think. If it's a thumbs' up, I'll probably
 implement this next week. Thanks!


I really like this plan.  It seems easy to implement and quite useful.  +1
from me!

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-21 Thread Evan Martin

This seems to me like a lot more work for minimal gain.  Because
you've thought more about it than I have, it makes me think I'm
misunderstanding something.  Can you explain this more simply, in
terms of use cases?

Here's what I think you're saying:
1) We don't have notes on why tests are failing.  =  Why not annotate
the tests in test_lists?  That's what I've always done.

2) We don't have a way of tracking when a failing test output changes.
 =  But failing is failing; no matter what you want a human to look
at the result before you mark it as passing, so it doesn't seem like
it's worth a bunch of extra machinery to track this.  And if a test
starts passing it gets marked unexpected pass by the builders
already, and it also seems like a human should look at it.

On Fri, Aug 21, 2009 at 1:00 PM, Dirk Prankedpra...@chromium.org wrote:

 Hi all,

 As Glenn noted, we made great progress last week in rebaselining the
 tests. Unfortunately, we don't have a mechanism to preserve the
 knowledge we gained last week as to whether or not tests need to be
 rebaselined or not, and why not. As a result, it's easy to imagine
 that we'd need to repeat this process every few months.

 I've written up a proposal for preventing this from happening again,
 and I think it will also help us notice more regressions in the
 future. Check out:

 http://dev.chromium.org/developers/design-documents/handlinglayouttestexpectations

 Here's the executive summary from that document:

 We have a lot of layout test failures. For each test failure, we have
 no good way of tracking whether or not someone has looked at the test
 output lately, and whether or not the test output is still broken or
 should be rebaselined. We just went through a week of rebaselining,
 and stand a good chance of needing to do that again in a few months
 and losing all of the knowledge that was captured last week.

 So, I propose a way to capture the current broken output from
 failing tests, and to version control them so that we can tell when a
 test's output changes from one expected failing result to another.
 Such a change may reflect that there has been a regression, or that
 the bug has been fixed and the test should be rebaselined.

 Changes

 We modify the layout test scripts to check for 'foo-bad' as well as
 'foo-expected'. If the output of test foo does not match
 'foo-expected', then we check to see if it matches 'foo-bad'. If it
 does, then we treat it as we treat test failures today, except that
 there is no need to save the failed test result (since a version of
 the output is already checked in). Note that although -bad is
 similar to a different platform, we cannot actually use a different
 platform, since we actually need up to N different -bad versions,
 one for each supported platform that a test fails on.
 We check in a set of '*-bad' baselines based on current output from
 the regressions. In theory, they should all be legitimate.
 We modify the test to also report regressions from the *-bad
 baselines. In the cases where we know the failing test is also flaky
 or nondeterministic, we can indicate that as NDFAIL in test
 expectations to distinguish from a regular deterministic FAIL.
 We modify the rebaselining tools to handle *-bad output as well as
 *-expected.
 Just like we require each test failure to be associated with a bug, we
 require each *-bad output to be associated with a bug - normally
 (always?) the same bug. The bug should contain comments about what the
 difference is between the broken output and the expected output, and
 why it's different, e.g., something like Note that the text is in two
 lines in the -bad output, and it should be all on the same line
 without wrapping.
 The same approach can be used here to justify platform-specific
 variances in output, if we decide to become even more picky about
 this, but I suggest we learn to walk before we try to run.
 Eventually (?) we modify the layout test scripts themselves to fail if
 the *-bad baselines aren't matched.

 Let me know what you think. If it's a thumbs' up, I'll probably
 implement this next week. Thanks!

 -- Dirk

 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-21 Thread Pam Greene
At least in the batch of tests I examined, the ones that needed
re-baselining weren't tests we'd originally failed and suddenly started
passing. They were new tests that nobody had ever taken a good look at.

If that matches everyone else's experience, then all we need is an UNTRIAGED
annotation in the test_expectations file to mark ones the next Great
Re-Baselining needs to examine.

I'm not convinced that passing tests we used to fail, or failing tests
differently, happens often enough to warrant the extra work of producing,
storing, and using expected-bad results. Of course, I may be completely
wrong. What did other people see in their batches of tests?

- Pam

On Fri, Aug 21, 2009 at 1:16 PM, Jeremy Orlow jor...@chromium.org wrote:

 On Fri, Aug 21, 2009 at 1:00 PM, Dirk Pranke dpra...@chromium.org wrote:


 Hi all,

 As Glenn noted, we made great progress last week in rebaselining the
 tests. Unfortunately, we don't have a mechanism to preserve the
 knowledge we gained last week as to whether or not tests need to be
 rebaselined or not, and why not. As a result, it's easy to imagine
 that we'd need to repeat this process every few months.

 I've written up a proposal for preventing this from happening again,
 and I think it will also help us notice more regressions in the
 future. Check out:


 http://dev.chromium.org/developers/design-documents/handlinglayouttestexpectations

 Here's the executive summary from that document:

 We have a lot of layout test failures. For each test failure, we have
 no good way of tracking whether or not someone has looked at the test
 output lately, and whether or not the test output is still broken or
 should be rebaselined. We just went through a week of rebaselining,
 and stand a good chance of needing to do that again in a few months
 and losing all of the knowledge that was captured last week.

 So, I propose a way to capture the current broken output from
 failing tests, and to version control them so that we can tell when a
 test's output changes from one expected failing result to another.
 Such a change may reflect that there has been a regression, or that
 the bug has been fixed and the test should be rebaselined.

 Changes

 We modify the layout test scripts to check for 'foo-bad' as well as
 'foo-expected'. If the output of test foo does not match
 'foo-expected', then we check to see if it matches 'foo-bad'. If it
 does, then we treat it as we treat test failures today, except that
 there is no need to save the failed test result (since a version of
 the output is already checked in). Note that although -bad is
 similar to a different platform, we cannot actually use a different
 platform, since we actually need up to N different -bad versions,
 one for each supported platform that a test fails on.
 We check in a set of '*-bad' baselines based on current output from
 the regressions. In theory, they should all be legitimate.
 We modify the test to also report regressions from the *-bad
 baselines. In the cases where we know the failing test is also flaky
 or nondeterministic, we can indicate that as NDFAIL in test
 expectations to distinguish from a regular deterministic FAIL.
 We modify the rebaselining tools to handle *-bad output as well as
 *-expected.
 Just like we require each test failure to be associated with a bug, we
 require each *-bad output to be associated with a bug - normally
 (always?) the same bug. The bug should contain comments about what the
 difference is between the broken output and the expected output, and
 why it's different, e.g., something like Note that the text is in two
 lines in the -bad output, and it should be all on the same line
 without wrapping.
 The same approach can be used here to justify platform-specific
 variances in output, if we decide to become even more picky about
 this, but I suggest we learn to walk before we try to run.
 Eventually (?) we modify the layout test scripts themselves to fail if
 the *-bad baselines aren't matched.

 Let me know what you think. If it's a thumbs' up, I'll probably
 implement this next week. Thanks!


 I really like this plan.  It seems easy to implement and quite useful.  +1
 from me!

 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-21 Thread Peter Kasting
On Fri, Aug 21, 2009 at 2:33 PM, Pam Greene p...@chromium.org wrote:

 I'm not convinced that passing tests we used to fail, or failing tests
 differently, happens often enough to warrant the extra work of producing,
 storing, and using expected-bad results. Of course, I may be completely
 wrong. What did other people see in their batches of tests?


There were a number of tests in my set that were affected by innocuous
upstream changes (the type that would cause me to rebaseline) but were also
affected by some other critical bug that meant I couldn't rebaseline.  I
left comments about these on the relevant bugs and occasionally in the
expectations file.

Generally when looking at a new test I can tell whether it makes sense to
rebaseline or not without the aid of when did we fail this before?, since
there are upstream baselines and also obvious correct and incorrect outputs
given the test file.

I agree that the benefit here is low (for me, near zero) and the cost is
not.

PK

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-21 Thread Dirk Pranke

On Fri, Aug 21, 2009 at 4:47 PM, Peter Kastingpkast...@chromium.org wrote:
 On Fri, Aug 21, 2009 at 2:33 PM, Pam Greene p...@chromium.org wrote:

 I'm not convinced that passing tests we used to fail, or failing tests
 differently, happens often enough to warrant the extra work of producing,
 storing, and using expected-bad results. Of course, I may be completely
 wrong. What did other people see in their batches of tests?

 There were a number of tests in my set that were affected by innocuous
 upstream changes (the type that would cause me to rebaseline) but were also
 affected by some other critical bug that meant I couldn't rebaseline.  I
 left comments about these on the relevant bugs and occasionally in the
 expectations file.
 Generally when looking at a new test I can tell whether it makes sense to
 rebaseline or not without the aid of when did we fail this before?, since
 there are upstream baselines and also obvious correct and incorrect outputs
 given the test file.
 I agree that the benefit here is low (for me, near zero) and the cost is
 not.
 PK

This is all good feedback, thanks! To clarify, though: what do you
think the cost will be? Perhaps you are assuming things about how I
would implement this that are different than what I had in mind.

-- Dirk

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-21 Thread Peter Kasting
On Fri, Aug 21, 2009 at 4:50 PM, Dirk Pranke dpra...@chromium.org wrote:

 This is all good feedback, thanks! To clarify, though: what do you
 think the cost will be? Perhaps you are assuming things about how I
 would implement this that are different than what I had in mind.


Some amount of your time, and some amount of space on the bots.

PK

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-21 Thread Ojan Vafai
On Fri, Aug 21, 2009 at 4:54 PM, Peter Kasting pkast...@chromium.orgwrote:

 On Fri, Aug 21, 2009 at 4:50 PM, Dirk Pranke dpra...@chromium.org wrote:

 This is all good feedback, thanks! To clarify, though: what do you
 think the cost will be? Perhaps you are assuming things about how I
 would implement this that are different than what I had in mind.


 Some amount of your time, and some amount of space on the bots.


Also, some amount of the rest of the team's time to follow this process.

Ojan

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-21 Thread Dirk Pranke

On Fri, Aug 21, 2009 at 6:43 PM, Ojan Vafaio...@chromium.org wrote:
 On Fri, Aug 21, 2009 at 4:54 PM, Peter Kasting pkast...@chromium.org
 wrote:

 On Fri, Aug 21, 2009 at 4:50 PM, Dirk Pranke dpra...@chromium.org wrote:

 This is all good feedback, thanks! To clarify, though: what do you
 think the cost will be? Perhaps you are assuming things about how I
 would implement this that are different than what I had in mind.

 Some amount of your time, and some amount of space on the bots.

 Also, some amount of the rest of the team's time to follow this process.
 Ojan

Okay, it sounds like there's enough initial skepticism that it's
probably worth doing a hack before pushing this fully through. I think
I'll try to take a few snapshots of the layout test failures over a
few days and see if we see any real diffs, and then report back.

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Handling layout test expectations for failing tests

2009-08-21 Thread Jeremy Orlow
On Fri, Aug 21, 2009 at 2:33 PM, Pam Greene p...@chromium.org wrote:

 At least in the batch of tests I examined, the ones that needed
 re-baselining weren't tests we'd originally failed and suddenly started
 passing. They were new tests that nobody had ever taken a good look at.

 If that matches everyone else's experience, then all we need is an
 UNTRIAGED annotation in the test_expectations file to mark ones the next
 Great Re-Baselining needs to examine.


What happens when someone forgets to set this flag?  Didn't we want to avoid
adding any more state to the test_expectations file?

 On Fri, Aug 21, 2009 at 1:47 PM, Evan Martin e...@chromium.org wrote:


 This seems to me like a lot more work for minimal gain.  Because
 you've thought more about it than I have, it makes me think I'm
 misunderstanding something.  Can you explain this more simply, in
 terms of use cases?

 Here's what I think you're saying:
 1) We don't have notes on why tests are failing.  =  Why not annotate
 the tests in test_lists?  That's what I've always done.


Once again, we don't want to add more state to the test_expectations.  How
may people looked up the tests they were supposed to rebaseline in this file
to see if there were notes?  I kind of doubt anyone.


 2) We don't have a way of tracking when a failing test output changes.
  =  But failing is failing; no matter what you want a human to look
 at the result before you mark it as passing, so it doesn't seem like
 it's worth a bunch of extra machinery to track this.  And if a test
 starts passing it gets marked unexpected pass by the builders
 already, and it also seems like a human should look at it.


There are different reasons for failing.  A layout test could be failing
because of a known bug and then start failing in a different way (later) due
to a regression.  When a bug fails in a new way, it's worth taking a quick
look, I think.

On Fri, Aug 21, 2009 at 7:52 PM, Dirk Pranke dpra...@chromium.org wrote:

 On Fri, Aug 21, 2009 at 6:43 PM, Ojan Vafaio...@chromium.org wrote:
  On Fri, Aug 21, 2009 at 4:54 PM, Peter Kasting pkast...@chromium.org
  wrote:
 
  On Fri, Aug 21, 2009 at 4:50 PM, Dirk Pranke dpra...@chromium.org
 wrote:
 
  This is all good feedback, thanks! To clarify, though: what do you
  think the cost will be? Perhaps you are assuming things about how I
  would implement this that are different than what I had in mind.
 
  Some amount of your time, and some amount of space on the bots.
 
  Also, some amount of the rest of the team's time to follow this process.
  Ojan

 Okay, it sounds like there's enough initial skepticism that it's
 probably worth doing a hack before pushing this fully through. I think
 I'll try to take a few snapshots of the layout test failures over a
 few days and see if we see any real diffs, and then report back.


All of this said, I agree that there is a cost to maintaining this that I
didn't consider at first.  I think the approach you're taking Dirk (doing it
locally for a while and seeing if it's useful) is probably the right one.

Of course the long term solution is to get the layout test failures to 0.
 :-)

J

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---