[chromium-dev] Re: Handling layout test expectations for failing tests
On Sat, Aug 22, 2009 at 4:29 PM, Jeremy Orlow jor...@chromium.org wrote: On Sat, Aug 22, 2009 at 4:00 PM, Peter Kasting pkast...@chromium.orgwrote: On Fri, Aug 21, 2009 at 8:07 PM, Jeremy Orlow jor...@chromium.orgwrote: 1) We don't have notes on why tests are failing. = Why not annotate the tests in test_lists? That's what I've always done. Once again, we don't want to add more state to the test_expectations. How may people looked up the tests they were supposed to rebaseline in this file to see if there were notes? I kind of doubt anyone. Um... this makes no sense to me. You can't rebaseline a test without modifying test_expectations. In modifying it, you *have* to look at it. It's pretty difficult to miss comments above tests as you're trying to write REBASELINE or delete the line. If you somehow managed to not see any comments in this file, I think you're an outlier. I was talking about the rebaselining teams, not the act of actually rebaselining. If someone's rebaselining a test, then it means we now believe it's passing. At that point, the notes are not very interesting, right? Are you saying that you looked at all the tests' notes before you looked through the results to determine if they should be rebaselined? We're trying to leave all comments in the bugs now, rather than in the test_expectations file, so there's only one point of contact. We used to leave extensive comments in the file, but they always grew stale. And yes, I looked at the bug for every test that I thought was correct, usually to write tests A, B and C are still bad, but D was actually correct and is being re-baselined. There are different reasons for failing. A layout test could be failing because of a known bug and then start failing in a different way (later) due to a regression. When a bug fails in a new way, it's worth taking a quick look, I think. Why? Unless the earlier failure has been fixed we can't rebaseline the test. (I ran into a number of tests like this when doing my rebaselining pass.) What is the point of looking again? In case the new failure is more serious than the earlier one. True. But I don't think this will happen often, and I'd rather devote the time to fixing the tests. - Pam --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Mon, Aug 24, 2009 at 10:37 AM, Pam Greene p...@chromium.org wrote: On Sat, Aug 22, 2009 at 4:29 PM, Jeremy Orlow jor...@chromium.org wrote: On Sat, Aug 22, 2009 at 4:00 PM, Peter Kasting pkast...@chromium.orgwrote: On Fri, Aug 21, 2009 at 8:07 PM, Jeremy Orlow jor...@chromium.orgwrote: There are different reasons for failing. A layout test could be failing because of a known bug and then start failing in a different way (later) due to a regression. When a bug fails in a new way, it's worth taking a quick look, I think. Why? Unless the earlier failure has been fixed we can't rebaseline the test. (I ran into a number of tests like this when doing my rebaselining pass.) What is the point of looking again? In case the new failure is more serious than the earlier one. True. But I don't think this will happen often, and I'd rather devote the time to fixing the tests. The end goal is to be in a state where we have near zero failing tests that are not for unimplemented features. And new failures from the merge get addressed within a week. Once we're at that point, would this new infrastructure be useful? I completely support infrastructure that sustainably supports us being at near zero failing tests (e.g. the rebaseline tool). All infrastructure/process has a maintenance cost though. Ojan --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Mon, Aug 24, 2009 at 11:37 AM, Ojan Vafaio...@chromium.org wrote: The end goal is to be in a state where we have near zero failing tests that are not for unimplemented features. And new failures from the merge get addressed within a week. Once we're at that point, would this new infrastructure be useful? I completely support infrastructure that sustainably supports us being at near zero failing tests (e.g. the rebaseline tool). All infrastructure/process has a maintenance cost though. True enough. There are at least two counterexamples that are worth considering. The first is that probably won't be at zero failing tests any time soon (where any time soon == next 3-6 months), and so there may be intermediary value. The second is that we have a policy of running every test, even tests for unimplemented features, and so we may catch regressions for the foreseeable future. That said, I don't know if the value will offset the cost. Hence the desire to run a couple of cheap experiments :) -- Dirk --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Mon, Aug 24, 2009 at 1:52 PM, David Levinle...@google.com wrote: On Mon, Aug 24, 2009 at 1:37 PM, Dirk Pranke dpra...@chromium.org wrote: On Mon, Aug 24, 2009 at 11:37 AM, Ojan Vafaio...@chromium.org wrote: The end goal is to be in a state where we have near zero failing tests that are not for unimplemented features. And new failures from the merge get addressed within a week. Once we're at that point, would this new infrastructure be useful? I completely support infrastructure that sustainably supports us being at near zero failing tests (e.g. the rebaseline tool). All infrastructure/process has a maintenance cost though. True enough. There are at least two counterexamples that are worth considering. The first is that probably won't be at zero failing tests any time soon (where any time soon == next 3-6 months), and so there may be intermediary value. The second is that we have a policy of running every test, even tests for unimplemented features, and so we may catch regressions for the foreseeable future. That said, I don't know if the value will offset the cost. Hence the desire to run a couple of cheap experiments :) What do the cheap experiments entail? Key concern: If the cheapness is to put more work on the webkit gardeners, it isn't cheap at all imo. Cheap experiments == me snapshotting the results of tests I run periodically and comparing them. No work for anyone else. -- Dirk --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Mon, Aug 24, 2009 at 1:37 PM, Dirk Pranke dpra...@chromium.org wrote: On Mon, Aug 24, 2009 at 11:37 AM, Ojan Vafaio...@chromium.org wrote: The end goal is to be in a state where we have near zero failing tests that are not for unimplemented features. And new failures from the merge get addressed within a week. Once we're at that point, would this new infrastructure be useful? I completely support infrastructure that sustainably supports us being at near zero failing tests (e.g. the rebaseline tool). All infrastructure/process has a maintenance cost though. True enough. There are at least two counterexamples that are worth considering. The first is that probably won't be at zero failing tests any time soon (where any time soon == next 3-6 months), and so there may be intermediary value. The second is that we have a policy of running every test, even tests for unimplemented features, and so we may catch regressions for the foreseeable future. That said, I don't know if the value will offset the cost. Hence the desire to run a couple of cheap experiments :) What do the cheap experiments entail? Key concern: If the cheapness is to put more work on the webkit gardeners, it isn't cheap at all imo. -- Dirk --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Sat, Aug 22, 2009 at 9:51 PM, Jeremy Orlowjor...@chromium.org wrote: It might be worth going through all the LayoutTest bugs and double check they're split up into individual root causes (or something approximating that). I'll try to make time to do a scan in the next week or so, but it'd be great if anyone else had time to help. :-) I've been doing this last week. Maybe we could figure out how to do this in parallel on Monday? :DG --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
I understand the resistance to implement yet another bit of process and effort around layout tests. I really do. However, I found some merit in Dirk's idea -- it allows us to clearly see the impact of a regression. Sadly, I can't come up with a specific example at the moment, but let me pull one out of my ... hat, based on previous experiences. Let's say we had a regression in JSON parsing. But since we already fail parts of the LayoutTests/fast/js/JSON-parse.html, we wouldn't notice it. Especially with DOM bindings, there are tons of tests like this -- we pass only parts of them, so we wouldn't know when our changes or commits upstream introduce regressions that we really ought to be noticing. It's kind of like marking layout tests as flakey: there's no way to determine whether the flakiness is gone other than by recording some extra information. So to me at least, the benefit of this type of solution is not near-zero. My only hesitation comes from having to decide whether we should stop and implement this rather than dedicate all of our resources to plowing ahead in fixing layout tests and driving the number to 0 (and thus eliminating the need for this solution). :DG On Fri, Aug 21, 2009 at 6:43 PM, Ojan Vafaio...@chromium.org wrote: On Fri, Aug 21, 2009 at 4:54 PM, Peter Kasting pkast...@chromium.org wrote: On Fri, Aug 21, 2009 at 4:50 PM, Dirk Pranke dpra...@chromium.org wrote: This is all good feedback, thanks! To clarify, though: what do you think the cost will be? Perhaps you are assuming things about how I would implement this that are different than what I had in mind. Some amount of your time, and some amount of space on the bots. Also, some amount of the rest of the team's time to follow this process. Ojan --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Fri, Aug 21, 2009 at 8:07 PM, Jeremy Orlow jor...@chromium.org wrote: 1) We don't have notes on why tests are failing. = Why not annotate the tests in test_lists? That's what I've always done. Once again, we don't want to add more state to the test_expectations. How may people looked up the tests they were supposed to rebaseline in this file to see if there were notes? I kind of doubt anyone. Um... this makes no sense to me. You can't rebaseline a test without modifying test_expectations. In modifying it, you *have* to look at it. It's pretty difficult to miss comments above tests as you're trying to write REBASELINE or delete the line. If you somehow managed to not see any comments in this file, I think you're an outlier. There are different reasons for failing. A layout test could be failing because of a known bug and then start failing in a different way (later) due to a regression. When a bug fails in a new way, it's worth taking a quick look, I think. Why? Unless the earlier failure has been fixed we can't rebaseline the test. (I ran into a number of tests like this when doing my rebaselining pass.) What is the point of looking again? PK --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Sat, Aug 22, 2009 at 4:00 PM, Peter Kasting pkast...@chromium.orgwrote: On Fri, Aug 21, 2009 at 8:07 PM, Jeremy Orlow jor...@chromium.org wrote: 1) We don't have notes on why tests are failing. = Why not annotate the tests in test_lists? That's what I've always done. Once again, we don't want to add more state to the test_expectations. How may people looked up the tests they were supposed to rebaseline in this file to see if there were notes? I kind of doubt anyone. Um... this makes no sense to me. You can't rebaseline a test without modifying test_expectations. In modifying it, you *have* to look at it. It's pretty difficult to miss comments above tests as you're trying to write REBASELINE or delete the line. If you somehow managed to not see any comments in this file, I think you're an outlier. I was talking about the rebaselining teams, not the act of actually rebaselining. If someone's rebaselining a test, then it means we now believe it's passing. At that point, the notes are not very interesting, right? Are you saying that you looked at all the tests' notes before you looked through the results to determine if they should be rebaselined? There are different reasons for failing. A layout test could be failing because of a known bug and then start failing in a different way (later) due to a regression. When a bug fails in a new way, it's worth taking a quick look, I think. Why? Unless the earlier failure has been fixed we can't rebaseline the test. (I ran into a number of tests like this when doing my rebaselining pass.) What is the point of looking again? In case the new failure is more serious than the earlier one. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Sat, Aug 22, 2009 at 5:54 PM, Peter Kasting pkast...@chromium.orgwrote: On Sat, Aug 22, 2009 at 4:29 PM, Jeremy Orlow jor...@chromium.org wrote: On Sat, Aug 22, 2009 at 4:00 PM, Peter Kasting pkast...@chromium.orgwrote: If you somehow managed to not see any comments in this file, I think you're an outlier. I was talking about the rebaselining teams, not the act of actually rebaselining. If someone's rebaselining a test, then it means we now believe it's passing. At that point, the notes are not very interesting, right? Are you saying that you looked at all the tests' notes before you looked through the results to determine if they should be rebaselined? I certainly looked at them during the process of determining what was going on, and left several notes of my own. I don't think I understand your objection. Are you saying notes are useless or that they're harmful? I don't think either is true. If you're trying to determine how to fix a layout test, the notes in the file are one of the first things you see, because you're looking in the file to find the bug #, what OSes are affected, etc. At that point notes that say what to look for are useful. If you're trying to determine whether to rebaseline a test, notes are at worst harmless and at best useful in pointing out some subtlety that you overlooked if you'd already made your decision. You HAVE to see the notes because you HAVE to edit the file. Notes in test_expectations.txt are like comments in source code: A great boon. I've herd differing opinions, but you're the definitely the most gung-ho I've talked to about notes in the test_expectations.txt file. Typically bugs are where most if not all of the information on failures should be kept. If there is information in the test_expectations.txt file, it should certainly be a subset of the information in the bugs, would you not agree? There are different reasons for failing. A layout test could be failing because of a known bug and then start failing in a different way (later) due to a regression. When a bug fails in a new way, it's worth taking a quick look, I think. Why? Unless the earlier failure has been fixed we can't rebaseline the test. (I ran into a number of tests like this when doing my rebaselining pass.) What is the point of looking again? In case the new failure is more serious than the earlier one. The only possible reason I could think that would matter is if we're using this as a source of triage input into which bugs we should fix first. But we have so many thousands of bugs, nearly all likely to be higher priority than a second failure in a test we already haven't prioritized fixing, that I don't consider this a valuable signal. I suppose that is true. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Sat, Aug 22, 2009 at 7:49 PM, Jeremy Orlow jor...@chromium.org wrote: On Sat, Aug 22, 2009 at 5:54 PM, Peter Kasting pkast...@chromium.orgwrote: Notes in test_expectations.txt are like comments in source code: A great boon. I've herd differing opinions, but you're the definitely the most gung-ho I've talked to about notes in the test_expectations.txt file. Typically bugs are where most if not all of the information on failures should be kept. If there is information in the test_expectations.txt file, it should certainly be a subset of the information in the bugs, would you not agree? Yes, that is ideal. One nice thing about comments in the test_expectations file is that unlike comments in bugs, they're (a) hard to miss and (b) unlikely to be drowned by a sea of bugdroid comments and other spew. Also, frequently tests with completely different failures get grouped into one bug (merge failures r1-r2) and comments on the tests can help add clarity (although splitting these into multiple bugs is also advisable). PK --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Fri, Aug 21, 2009 at 1:00 PM, Dirk Pranke dpra...@chromium.org wrote: Hi all, As Glenn noted, we made great progress last week in rebaselining the tests. Unfortunately, we don't have a mechanism to preserve the knowledge we gained last week as to whether or not tests need to be rebaselined or not, and why not. As a result, it's easy to imagine that we'd need to repeat this process every few months. I've written up a proposal for preventing this from happening again, and I think it will also help us notice more regressions in the future. Check out: http://dev.chromium.org/developers/design-documents/handlinglayouttestexpectations Here's the executive summary from that document: We have a lot of layout test failures. For each test failure, we have no good way of tracking whether or not someone has looked at the test output lately, and whether or not the test output is still broken or should be rebaselined. We just went through a week of rebaselining, and stand a good chance of needing to do that again in a few months and losing all of the knowledge that was captured last week. So, I propose a way to capture the current broken output from failing tests, and to version control them so that we can tell when a test's output changes from one expected failing result to another. Such a change may reflect that there has been a regression, or that the bug has been fixed and the test should be rebaselined. Changes We modify the layout test scripts to check for 'foo-bad' as well as 'foo-expected'. If the output of test foo does not match 'foo-expected', then we check to see if it matches 'foo-bad'. If it does, then we treat it as we treat test failures today, except that there is no need to save the failed test result (since a version of the output is already checked in). Note that although -bad is similar to a different platform, we cannot actually use a different platform, since we actually need up to N different -bad versions, one for each supported platform that a test fails on. We check in a set of '*-bad' baselines based on current output from the regressions. In theory, they should all be legitimate. We modify the test to also report regressions from the *-bad baselines. In the cases where we know the failing test is also flaky or nondeterministic, we can indicate that as NDFAIL in test expectations to distinguish from a regular deterministic FAIL. We modify the rebaselining tools to handle *-bad output as well as *-expected. Just like we require each test failure to be associated with a bug, we require each *-bad output to be associated with a bug - normally (always?) the same bug. The bug should contain comments about what the difference is between the broken output and the expected output, and why it's different, e.g., something like Note that the text is in two lines in the -bad output, and it should be all on the same line without wrapping. The same approach can be used here to justify platform-specific variances in output, if we decide to become even more picky about this, but I suggest we learn to walk before we try to run. Eventually (?) we modify the layout test scripts themselves to fail if the *-bad baselines aren't matched. Let me know what you think. If it's a thumbs' up, I'll probably implement this next week. Thanks! I really like this plan. It seems easy to implement and quite useful. +1 from me! --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
This seems to me like a lot more work for minimal gain. Because you've thought more about it than I have, it makes me think I'm misunderstanding something. Can you explain this more simply, in terms of use cases? Here's what I think you're saying: 1) We don't have notes on why tests are failing. = Why not annotate the tests in test_lists? That's what I've always done. 2) We don't have a way of tracking when a failing test output changes. = But failing is failing; no matter what you want a human to look at the result before you mark it as passing, so it doesn't seem like it's worth a bunch of extra machinery to track this. And if a test starts passing it gets marked unexpected pass by the builders already, and it also seems like a human should look at it. On Fri, Aug 21, 2009 at 1:00 PM, Dirk Prankedpra...@chromium.org wrote: Hi all, As Glenn noted, we made great progress last week in rebaselining the tests. Unfortunately, we don't have a mechanism to preserve the knowledge we gained last week as to whether or not tests need to be rebaselined or not, and why not. As a result, it's easy to imagine that we'd need to repeat this process every few months. I've written up a proposal for preventing this from happening again, and I think it will also help us notice more regressions in the future. Check out: http://dev.chromium.org/developers/design-documents/handlinglayouttestexpectations Here's the executive summary from that document: We have a lot of layout test failures. For each test failure, we have no good way of tracking whether or not someone has looked at the test output lately, and whether or not the test output is still broken or should be rebaselined. We just went through a week of rebaselining, and stand a good chance of needing to do that again in a few months and losing all of the knowledge that was captured last week. So, I propose a way to capture the current broken output from failing tests, and to version control them so that we can tell when a test's output changes from one expected failing result to another. Such a change may reflect that there has been a regression, or that the bug has been fixed and the test should be rebaselined. Changes We modify the layout test scripts to check for 'foo-bad' as well as 'foo-expected'. If the output of test foo does not match 'foo-expected', then we check to see if it matches 'foo-bad'. If it does, then we treat it as we treat test failures today, except that there is no need to save the failed test result (since a version of the output is already checked in). Note that although -bad is similar to a different platform, we cannot actually use a different platform, since we actually need up to N different -bad versions, one for each supported platform that a test fails on. We check in a set of '*-bad' baselines based on current output from the regressions. In theory, they should all be legitimate. We modify the test to also report regressions from the *-bad baselines. In the cases where we know the failing test is also flaky or nondeterministic, we can indicate that as NDFAIL in test expectations to distinguish from a regular deterministic FAIL. We modify the rebaselining tools to handle *-bad output as well as *-expected. Just like we require each test failure to be associated with a bug, we require each *-bad output to be associated with a bug - normally (always?) the same bug. The bug should contain comments about what the difference is between the broken output and the expected output, and why it's different, e.g., something like Note that the text is in two lines in the -bad output, and it should be all on the same line without wrapping. The same approach can be used here to justify platform-specific variances in output, if we decide to become even more picky about this, but I suggest we learn to walk before we try to run. Eventually (?) we modify the layout test scripts themselves to fail if the *-bad baselines aren't matched. Let me know what you think. If it's a thumbs' up, I'll probably implement this next week. Thanks! -- Dirk --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
At least in the batch of tests I examined, the ones that needed re-baselining weren't tests we'd originally failed and suddenly started passing. They were new tests that nobody had ever taken a good look at. If that matches everyone else's experience, then all we need is an UNTRIAGED annotation in the test_expectations file to mark ones the next Great Re-Baselining needs to examine. I'm not convinced that passing tests we used to fail, or failing tests differently, happens often enough to warrant the extra work of producing, storing, and using expected-bad results. Of course, I may be completely wrong. What did other people see in their batches of tests? - Pam On Fri, Aug 21, 2009 at 1:16 PM, Jeremy Orlow jor...@chromium.org wrote: On Fri, Aug 21, 2009 at 1:00 PM, Dirk Pranke dpra...@chromium.org wrote: Hi all, As Glenn noted, we made great progress last week in rebaselining the tests. Unfortunately, we don't have a mechanism to preserve the knowledge we gained last week as to whether or not tests need to be rebaselined or not, and why not. As a result, it's easy to imagine that we'd need to repeat this process every few months. I've written up a proposal for preventing this from happening again, and I think it will also help us notice more regressions in the future. Check out: http://dev.chromium.org/developers/design-documents/handlinglayouttestexpectations Here's the executive summary from that document: We have a lot of layout test failures. For each test failure, we have no good way of tracking whether or not someone has looked at the test output lately, and whether or not the test output is still broken or should be rebaselined. We just went through a week of rebaselining, and stand a good chance of needing to do that again in a few months and losing all of the knowledge that was captured last week. So, I propose a way to capture the current broken output from failing tests, and to version control them so that we can tell when a test's output changes from one expected failing result to another. Such a change may reflect that there has been a regression, or that the bug has been fixed and the test should be rebaselined. Changes We modify the layout test scripts to check for 'foo-bad' as well as 'foo-expected'. If the output of test foo does not match 'foo-expected', then we check to see if it matches 'foo-bad'. If it does, then we treat it as we treat test failures today, except that there is no need to save the failed test result (since a version of the output is already checked in). Note that although -bad is similar to a different platform, we cannot actually use a different platform, since we actually need up to N different -bad versions, one for each supported platform that a test fails on. We check in a set of '*-bad' baselines based on current output from the regressions. In theory, they should all be legitimate. We modify the test to also report regressions from the *-bad baselines. In the cases where we know the failing test is also flaky or nondeterministic, we can indicate that as NDFAIL in test expectations to distinguish from a regular deterministic FAIL. We modify the rebaselining tools to handle *-bad output as well as *-expected. Just like we require each test failure to be associated with a bug, we require each *-bad output to be associated with a bug - normally (always?) the same bug. The bug should contain comments about what the difference is between the broken output and the expected output, and why it's different, e.g., something like Note that the text is in two lines in the -bad output, and it should be all on the same line without wrapping. The same approach can be used here to justify platform-specific variances in output, if we decide to become even more picky about this, but I suggest we learn to walk before we try to run. Eventually (?) we modify the layout test scripts themselves to fail if the *-bad baselines aren't matched. Let me know what you think. If it's a thumbs' up, I'll probably implement this next week. Thanks! I really like this plan. It seems easy to implement and quite useful. +1 from me! --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Fri, Aug 21, 2009 at 2:33 PM, Pam Greene p...@chromium.org wrote: I'm not convinced that passing tests we used to fail, or failing tests differently, happens often enough to warrant the extra work of producing, storing, and using expected-bad results. Of course, I may be completely wrong. What did other people see in their batches of tests? There were a number of tests in my set that were affected by innocuous upstream changes (the type that would cause me to rebaseline) but were also affected by some other critical bug that meant I couldn't rebaseline. I left comments about these on the relevant bugs and occasionally in the expectations file. Generally when looking at a new test I can tell whether it makes sense to rebaseline or not without the aid of when did we fail this before?, since there are upstream baselines and also obvious correct and incorrect outputs given the test file. I agree that the benefit here is low (for me, near zero) and the cost is not. PK --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Fri, Aug 21, 2009 at 4:47 PM, Peter Kastingpkast...@chromium.org wrote: On Fri, Aug 21, 2009 at 2:33 PM, Pam Greene p...@chromium.org wrote: I'm not convinced that passing tests we used to fail, or failing tests differently, happens often enough to warrant the extra work of producing, storing, and using expected-bad results. Of course, I may be completely wrong. What did other people see in their batches of tests? There were a number of tests in my set that were affected by innocuous upstream changes (the type that would cause me to rebaseline) but were also affected by some other critical bug that meant I couldn't rebaseline. I left comments about these on the relevant bugs and occasionally in the expectations file. Generally when looking at a new test I can tell whether it makes sense to rebaseline or not without the aid of when did we fail this before?, since there are upstream baselines and also obvious correct and incorrect outputs given the test file. I agree that the benefit here is low (for me, near zero) and the cost is not. PK This is all good feedback, thanks! To clarify, though: what do you think the cost will be? Perhaps you are assuming things about how I would implement this that are different than what I had in mind. -- Dirk --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Fri, Aug 21, 2009 at 4:50 PM, Dirk Pranke dpra...@chromium.org wrote: This is all good feedback, thanks! To clarify, though: what do you think the cost will be? Perhaps you are assuming things about how I would implement this that are different than what I had in mind. Some amount of your time, and some amount of space on the bots. PK --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Fri, Aug 21, 2009 at 4:54 PM, Peter Kasting pkast...@chromium.orgwrote: On Fri, Aug 21, 2009 at 4:50 PM, Dirk Pranke dpra...@chromium.org wrote: This is all good feedback, thanks! To clarify, though: what do you think the cost will be? Perhaps you are assuming things about how I would implement this that are different than what I had in mind. Some amount of your time, and some amount of space on the bots. Also, some amount of the rest of the team's time to follow this process. Ojan --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Fri, Aug 21, 2009 at 6:43 PM, Ojan Vafaio...@chromium.org wrote: On Fri, Aug 21, 2009 at 4:54 PM, Peter Kasting pkast...@chromium.org wrote: On Fri, Aug 21, 2009 at 4:50 PM, Dirk Pranke dpra...@chromium.org wrote: This is all good feedback, thanks! To clarify, though: what do you think the cost will be? Perhaps you are assuming things about how I would implement this that are different than what I had in mind. Some amount of your time, and some amount of space on the bots. Also, some amount of the rest of the team's time to follow this process. Ojan Okay, it sounds like there's enough initial skepticism that it's probably worth doing a hack before pushing this fully through. I think I'll try to take a few snapshots of the layout test failures over a few days and see if we see any real diffs, and then report back. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Handling layout test expectations for failing tests
On Fri, Aug 21, 2009 at 2:33 PM, Pam Greene p...@chromium.org wrote: At least in the batch of tests I examined, the ones that needed re-baselining weren't tests we'd originally failed and suddenly started passing. They were new tests that nobody had ever taken a good look at. If that matches everyone else's experience, then all we need is an UNTRIAGED annotation in the test_expectations file to mark ones the next Great Re-Baselining needs to examine. What happens when someone forgets to set this flag? Didn't we want to avoid adding any more state to the test_expectations file? On Fri, Aug 21, 2009 at 1:47 PM, Evan Martin e...@chromium.org wrote: This seems to me like a lot more work for minimal gain. Because you've thought more about it than I have, it makes me think I'm misunderstanding something. Can you explain this more simply, in terms of use cases? Here's what I think you're saying: 1) We don't have notes on why tests are failing. = Why not annotate the tests in test_lists? That's what I've always done. Once again, we don't want to add more state to the test_expectations. How may people looked up the tests they were supposed to rebaseline in this file to see if there were notes? I kind of doubt anyone. 2) We don't have a way of tracking when a failing test output changes. = But failing is failing; no matter what you want a human to look at the result before you mark it as passing, so it doesn't seem like it's worth a bunch of extra machinery to track this. And if a test starts passing it gets marked unexpected pass by the builders already, and it also seems like a human should look at it. There are different reasons for failing. A layout test could be failing because of a known bug and then start failing in a different way (later) due to a regression. When a bug fails in a new way, it's worth taking a quick look, I think. On Fri, Aug 21, 2009 at 7:52 PM, Dirk Pranke dpra...@chromium.org wrote: On Fri, Aug 21, 2009 at 6:43 PM, Ojan Vafaio...@chromium.org wrote: On Fri, Aug 21, 2009 at 4:54 PM, Peter Kasting pkast...@chromium.org wrote: On Fri, Aug 21, 2009 at 4:50 PM, Dirk Pranke dpra...@chromium.org wrote: This is all good feedback, thanks! To clarify, though: what do you think the cost will be? Perhaps you are assuming things about how I would implement this that are different than what I had in mind. Some amount of your time, and some amount of space on the bots. Also, some amount of the rest of the team's time to follow this process. Ojan Okay, it sounds like there's enough initial skepticism that it's probably worth doing a hack before pushing this fully through. I think I'll try to take a few snapshots of the layout test failures over a few days and see if we see any real diffs, and then report back. All of this said, I agree that there is a cost to maintaining this that I didn't consider at first. I think the approach you're taking Dirk (doing it locally for a while and seeing if it's useful) is probably the right one. Of course the long term solution is to get the layout test failures to 0. :-) J --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---