Re: [blink-dev] Re: Layout tests should not use threaded compositing, part XIII

Stefan Zager Fri, 09 Jun 2023 15:28:18 -0700

On Fri, Jun 9, 2023 at 3:07 PM Elliott Sprehn <[email protected]> wrote:


> Woah 2015 super necro thread!
>
> There were a couple meta angles here:
>
> - Explicitly testing non-web exposed behavior is probably better done with
> a SimTest or other unit test. In the past folks kept adding more methods to
> Internals, or special behavior to the compositor, to make it easier to
> write web_tests. So not having the threaded compositor discourages some
> classes of that. I see folks are still adding new internals methods
> unfortunately, but I still think SimTests (or something else) is a
> better option.
>

Agreed: SimTests are *always* a better option, if you can get the right
coverage. The problem we've had is that SimTests test blink right up to the
blink/compositor boundary; and cc::LayerTreeTest test right up to the same
boundary, from the other side. But we don't have a unit test framework for
the renderer process that tests across that boundary. It's been proposed in
the past, and it's still probably a good idea, to connect these two test
frameworks so we can get better unit test coverage across the boundary. But
that only helps for new tests; we still miss out on the coverage provided
by the huge corpus of existing web_tests.


> - The more "complete" the browser run over web_tests, the more bugs are
> caught, but (at least at the time) browser_tests were very flaky and there
> was a strong desire to avoid that fate within blink. For example you could
> run the full chrome binary for web_tests and catch more bugs too. Why stop
> at just the threaded compositor? The thinking was making web_tests very
> stable was better than catching a small number of real bugs at the cost of
> a flake tax across the hundreds of contributors.
>

Yep, very true: destabilizing the whole system is not worth it. Our first
experiments with enabling threaded compositing by default were pretty
broken, but we discovered a few fixes that greatly reduced the number of
new flakes and failures, and we discovered that a relatively small number
of tests were responsible for just about all of the new flakiness. So we
placed those tests in quarantine: they will continue to run with the
single-threaded compositor, while the large majority of tests (>95%), and
all new tests, will use the threaded compositor. The best time to fix a
flaky test is right when the test is added, and that holds true even if
threaded compositing is the root cause of the flakiness.


> The failure mode I've seen of efforts like this is the SingleThreadedTests
> file grows each time someone finds some flake, so we never really find the
> bugs like 1447868 because folks just keep hiding them each time something
> unusual happens.
>
> All that said, if you can crank up the "realism" of the test runner and
> not introduce flake and convince folks not to hide all the bugs with
> exceptions, that sounds worth it. I hope you all succeed!
>
> - E
>
> On Fri, Jun 9, 2023 at 3:18 PM Stefan Zager <[email protected]> wrote:
>
>> On Fri, Jun 9, 2023 at 11:38 AM Ken Rockot <[email protected]> wrote:
>>
>>>
>>> On Fri, Jun 9, 2023 at 11:25 AM Stefan Zager <[email protected]>
>>> wrote:
>>>
>>>> Resurrecting this thread, because yotha@ are launching the first step
>>>> in a project that completely
>>>> ignores the previous consensus (i.e., that we should *not* enable
>>>> threaded compositing for all
>>>> web_tests).
>>>>
>>>
>>> This is exciting to hear! Is there a good place / bug to follow progress?
>>>
>>
>> Main tracking bug <http://crbug.com/770028>.
>>
>>
>>>
>>>
>>>>
>>>> I will paraphrase the high-level arguments made previously by enne@
>>>> and ojan@, neither of whom
>>>> are still active on chromium:
>>>>
>>>> - Enabling threaded compositing will increase test flakiness
>>>> - Previous efforts to fix the flakiness have been really difficult and
>>>> largely unsuccessful
>>>> - Buggy behavior in the blink/compositor thread boundary is best
>>>> covered by focused unit tests
>>>> - There are vanishingly few bugs in that area; maybe none?
>>>>
>>>> My reponse...
>>>>
>>>> web_tests, by their very nature, are a massive set of integration
>>>> tests. That is not the most efficient
>>>> organization of test coverage. But the beauty of web_tests is that they
>>>> catch a *ton* of bugs and
>>>> prevent many bad patches from landing. The large majority of the bugs
>>>> are in core blink areas,
>>>> but not all of them. web_tests provide effective, if inefficient,
>>>> coverage of almost the entire chromium
>>>> code base. The lack of coverage for threaded compositing is an anomaly
>>>> and a blink spot. "It's too
>>>> hard" is not, in my opinion, a convincing justification.
>>>>
>>>> Yes, turning on threaded compositing *does* make web_tests flakier. And
>>>> that is evidence of real bugs!
>>>> It's oxymoronic to acknowledge the flakiness but claim that there are
>>>> few or no bugs in the
>>>> blink/compositor thread boundary. There are definitely bugs in there!
>>>> They can be excruciatingly hard
>>>> to root-cause, but they definitely exist. A couple of examples...
>>>>
>>>> https://chromium-review.googlesource.com/c/chromium/src/+/3958364
>>>> https://bugs.chromium.org/p/chromium/issues/detail?id=1447868
>>>>
>>>> Once these bugs are root-caused, I certainly agree that we should
>>>> prefer unit tests for regression
>>>> testing. But they key point is that we only discovered these bugs
>>>> because they caused failures in
>>>> web_tests.
>>>>
>>>> Again: these are real bugs!
>>>>
>>>
>>> +1000
>>>
>>>
>>>>
>>>> Because bugs in the thread boundary are usually race conditions they
>>>> can be extremely difficult to
>>>> reproduce, which means that they're very likely under-reported. But
>>>> they still degrade user experience.
>>>> Beyond crbug.com and flaky web_tests, there are also crash reports
>>>> that very likely indicate bugs in this
>>>> area.
>>>>
>>>>
>>>> On Wednesday, November 4, 2015 at 4:58:29 PM UTC-8 Ojan Vafai wrote:
>>>>
>>>>> +1 on all points. I hope this new (currently half-baked) solution will
>>>>> help improve all three of these to varying degrees. It's not panacea, but
>>>>> it should be a lot better.
>>>>>
>>>>>
>>>>> On Wed, Nov 4, 2015 at 11:29 AM Adrienne Walker <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Re: interoperability tests.  Looking at the guidelines for the web
>>>>>> platform tests, I think if all of our layout tests had been written with
>>>>>> the guidelines of short and minimal in mind, we would be in a much better
>>>>>> situation than we are today.
>>>>>>
>>>>>> My biggest concerns about layout tests are: (1) they get used to test
>>>>>> very internal parts of Blink, which then get exposed via awful plumbing 
>>>>>> via
>>>>>> internals or dumping the state of the world into a text file.  (2) they
>>>>>> "over test", especially in terms of pixel and text output, which adds a
>>>>>> rebaselining burden as these things churn over time and over different
>>>>>> platforms, even though that doesn't affect what the test was trying to
>>>>>> verify.  (3) they rely on so much of the stack that they end up being 
>>>>>> flaky.
>>>>>>
>>>>>> I think there's plenty of room for html / javascript tests that are
>>>>>> cross-platform and test web platform features that don't fall into these
>>>>>> traps.
>>>>>>
>>>>>> 2015-11-04 9:32 GMT-08:00 Rick Byers <[email protected]>:
>>>>>>
>>>>>>> I agree with this big picture direction - i.e. unit tests are
>>>>>>> generally better than integration tests.
>>>>>>>
>>>>>>> However I HAVE seen examples in input where we've missed issues
>>>>>>> because we had only unit tests (content, cc and blink layout tests), not
>>>>>>> good integration test that included the effect of threaded input 
>>>>>>> handling.
>>>>>>> One case in particular was pretty embarrassing - IIRC we broke scrolling
>>>>>>> completely in most real-world cases yet not a single test failed :-).
>>>>>>> Still, I think this is, at best, an argument for having a couple sanity
>>>>>>> integration tests (eg. as content browser tests).
>>>>>>>
>>>>>>> My bigger concern here is around interoperability testing.  IMHO the
>>>>>>> only way we're going to substantially reduce developer pain around
>>>>>>> interoperability is by sharing more tests with the other rendering
>>>>>>> engines.  Mozilla is much better at this
>>>>>>> <https://wiki.mozilla.org/Auto-tools/Projects/web-platform-tests>
>>>>>>> than us.  This, by its nature, is impossible to do with unit tests.  So
>>>>>>> when new simple web platform features are being added (especially highly
>>>>>>> deterministic JS APIs), I've recently started encouraging my team to try
>>>>>>> focusing on web-platform-tests <http://testthewebforward.org>.
>>>>>>> Obviously more complex / non-deterministic features should still prefer 
>>>>>>> to
>>>>>>> focus on unit tests.  But the exact balance here seems very non-obvious 
>>>>>>> to
>>>>>>> me.  But like with everything, we should probably just try things, 
>>>>>>> evaluate
>>>>>>> how we're doing, and iterate.  Any concerns?
>>>>>>>
>>>>>>> Rick
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 4, 2015 at 11:39 AM, Ojan Vafai <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thread hijacking this to blink-dev@chromium.
>>>>>>>>
>>>>>>>> At a high level, I want to +1 that we should be moving our testing
>>>>>>>> away from high level regression tests into more deterministic tests.
>>>>>>>> There's a good (not quite finished) alternative to layout tests that I
>>>>>>>> think we should pursue (an evolution of Elliott's C++-based Sim 
>>>>>>>> framework).
>>>>>>>>
>>>>>>>> I'll be adding this to the Blink infra team's long-term roadmap
>>>>>>>> shortly. Coincidentally, Elliott and I hashed out a basic idea 
>>>>>>>> yesterday.
>>>>>>>> At some point we'll turn that into a doc and a proposal, but mostly it
>>>>>>>> needs people to do the engineering work to make it possible. So, if 
>>>>>>>> you're
>>>>>>>> possibly interested in 5%-80%ing on this, ping me.
>>>>>>>>
>>>>>>>> On Tue, Nov 3, 2015 at 4:04 PM 'Adrienne Walker' via paint-dev <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> (I sent this previously as an email reply to a few people about a
>>>>>>>>> particular patch, but as this continues to come up on a monthly 
>>>>>>>>> basis, here
>>>>>>>>> is a slightly edited copy of this rant as a reference of my biased 
>>>>>>>>> position
>>>>>>>>> on this topic.)
>>>>>>>>>
>>>>>>>>> We should *not* make layout tests use threaded compositing.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We shouldn't do it because it's a rat hole.  jamesr and I both
>>>>>>>>> tried to make this happen on separate occasions.  We spent a lot of 
>>>>>>>>> time
>>>>>>>>> and fought with even more flakiness.  I'm sure somebody could 
>>>>>>>>> heroically
>>>>>>>>> burn a lot of time making this happen and succeed where we couldn't, 
>>>>>>>>> but in
>>>>>>>>> my opinion, it's a waste of time.  Threaded compositing increases
>>>>>>>>> flakiness and loses determinism in terms of when events happen, how
>>>>>>>>> repaints line up with frames, and how animation happens, especially.  
>>>>>>>>> We
>>>>>>>>> already have a ton of problems with flakiness in layout tests, so why 
>>>>>>>>> make
>>>>>>>>> this worse?
>>>>>>>>>
>>>>>>>>> If there are features that are only enabled with threaded
>>>>>>>>> compositing (compositor animations?) that somehow still need
>>>>>>>>> Blink layout tests, we should enable those features for non-
>>>>>>>>> threaded compositing.  I would support removing the virtual/
>>>>>>>>> threaded directory if somebody made it possible to test those
>>>>>>>>> features in other ways.  If there are tests that have different 
>>>>>>>>> behavior in
>>>>>>>>> threaded mode vs not threaded mode, then they need to be deflaked or 
>>>>>>>>> tested
>>>>>>>>> more narrowly via unit tests or esprehn's simulation testing 
>>>>>>>>> framework.
>>>>>>>>>
>>>>>>>>> Single vs threaded compositing is generally well-tested within cc
>>>>>>>>> unit tests.  There should be very little difference at the Blink 
>>>>>>>>> level, and
>>>>>>>>> I would question the need to test it there.  We have ironed out a 
>>>>>>>>> small
>>>>>>>>> number of message ordering differences over time in cc.  We should 
>>>>>>>>> consider
>>>>>>>>> these differences bugs and add cc unit tests for them to guarantee 
>>>>>>>>> the same
>>>>>>>>> ordering of messages.  In the long term, I think the Blimp/Mandoline 
>>>>>>>>> work
>>>>>>>>> has the potential to merge single/thread proxy with an abstracted
>>>>>>>>> glue layer between them.
>>>>>>>>>
>>>>>>>>> So, sure, we should test what we ship, but on the other hand every
>>>>>>>>> test should not be an integration test.  I have have seen exactly 
>>>>>>>>> zero bugs
>>>>>>>>> ever that would have been caught if a layout test had been using 
>>>>>>>>> threaded
>>>>>>>>> compositing instead of single threaded compositing.  Really, zero.
>>>>>>>>>
>>>>>>>>> Personally, I think our problems as a team lie more on the "tests
>>>>>>>>> are flaky and hard to reason about because they're integration tests" 
>>>>>>>>> end
>>>>>>>>> of the spectrum and far less on the "we are shipping huge hidden
>>>>>>>>> regressions because of holes in our testing".
>>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Nov 3, 2015 at 5:17 PM Adrienne Walker <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> 2015-11-03 16:30 GMT-08:00 Alexey <[email protected]>:
>>>>>>>>>
>>>>>>>>>> Can we run all of the virtual/threaded tests in SingleThreadProxy
>>>>>>>>>> <https://code.google.com/p/chromium/codesearch#chromium/src/cc/trees/single_thread_proxy.cc&l=37&gs=cpp:cc::class-SingleThreadProxy::SingleThreadProxy(cc::LayerTreeHost%2520*,%2520cc::LayerTreeHostSingleThreadClient%2520*,%2520scoped_refptr%253Cbase::SingleThreadTaskRunner%253E,%2520scoped_ptr%253Ccc::BeginFrameSource,%2520base::DefaultDeleter%253Ccc::BeginFrameSource%253E%2520%253E)@chromium/../../cc/trees/single_thread_proxy.cc%257Cdef&gsn=SingleThreadProxy&ct=xref_usages>
>>>>>>>>>>  mode?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Currently, no!
>>>>>>>>>
>>>>>>>>> There are some things (e.g. composited animations) that have
>>>>>>>>> behavior differences in code depending on threaded vs single thread.  
>>>>>>>>> It
>>>>>>>>> seems to me that these are wrinkles that could be ironed out.  It 
>>>>>>>>> seems
>>>>>>>>> like there's no reason we couldn't create and run composited 
>>>>>>>>> animations in
>>>>>>>>> single-threaded mode.
>>>>>>>>>
>>>>>>>>> It looks like other things have been added there over time (e.g.
>>>>>>>>> idle callbacks), that I don't understand as well.  Naively, it seems 
>>>>>>>>> like
>>>>>>>>> fundamentally nothing should depend on threaded compositing.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "blink-dev" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/a/chromium.org/d/msgid/blink-dev/4d8dd468-b3c4-4f65-8ac7-1b4ca194e08an%40chromium.org
>>>> <https://groups.google.com/a/chromium.org/d/msgid/blink-dev/4d8dd468-b3c4-4f65-8ac7-1b4ca194e08an%40chromium.org?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/a/chromium.org/d/msgid/blink-dev/CAHOQ7J-0kFrERxC_PZdyfA6yoh_53%2Bu50gm_vuS-hzd1pAXUTg%40mail.gmail.com.

Re: [blink-dev] Re: Layout tests should not use threaded compositing, part XIII

Reply via email to