Re: On CI

2021-02-21 Thread John Ericson
I'm not opposed to some effort going into this, but I would strongly 
opposite putting all our effort there. Incremental CI can cut multiple 
hours to < mere minutes, especially with the test suite being 
embarrassingly parallel. There simply no way optimizations to the 
compiler independent from sharing a cache between CI runs can get 
anywhere close to that return on investment.


(FWIW, I'm also skeptical that the people complaining about GHC 
performance know what's hurting them most. For example, after 
non-incrementality, the next slowest thing is linking, which is...not 
done by GHC! But all that is a separate conversation.)


John

On 2/19/21 2:42 PM, Richard Eisenberg wrote:
There are some good ideas here, but I want to throw out another one: 
put all our effort into reducing compile times. There is a loud plea 
to do this on Discourse 
, 
and it would both solve these CI problems and also help everyone else.


This isn't to say to stop exploring the ideas here. But since time is 
mostly fixed, tackling compilation times in general may be the best 
way out of this. Ben's survey of other projects (thanks!) shows that 
we're way, way behind in how long our CI takes to run.


Richard

On Feb 19, 2021, at 7:20 AM, Sebastian Graf > wrote:


Recompilation avoidance

I think in order to cache more in CI, we first have to invest some 
time in fixing recompilation avoidance in our bootstrapped build system.


I just tested on a hadrian perf ticky build: Adding one line of 
*comment* in the compiler causes


  * a (pretty slow, yet negligible) rebuild of the stage1 compiler
  * 2 minutes of RTS rebuilding (Why do we have to rebuild the RTS?
It doesn't depend in any way on the change I made)
  * apparent full rebuild the libraries
  * apparent full rebuild of the stage2 compiler

That took 17 minutes, a full build takes ~45minutes. So there 
definitely is some caching going on, but not nearly as much as there 
could be.
I know there have been great and boring efforts on compiler 
determinism in the past, but either it's not good enough or our build 
system needs fixing.
I think a good first step to assert would be to make sure that the 
hash of the stage1 compiler executable doesn't change if I only 
change a comment.
I'm aware there probably is stuff going on, like embedding configure 
dates in interface files and executables, that would need to go, but 
if possible this would be a huge improvement.


On the other hand, we can simply tack on a [skip ci] to the commit 
message, as I did for 
https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4975 
. Variants 
like [skip tests] or [frontend] could help to identify which tests to 
run by default.


Lean

I had a chat with a colleague about how they do CI for Lean. 
Apparently, CI turnaround time including tests is generally 25 
minutes (~15 minutes for the build) for a complete pipeline, testing 
6 different OSes and configurations in parallel: 
https://github.com/leanprover/lean4/actions/workflows/ci.yml 

They utilise ccache to cache the clang-based C++-backend, so that 
they only have to re-run the front- and middle-end. In effect, they 
take advantage of the fact that the "function" clang, in contrast to 
the "function" stage1 compiler, stays the same.
It's hard to achieve that for GHC, where a complete compiler pipeline 
comes as one big, fused "function": An external tool can never be 
certain that a change to Parser.y could not affect the CodeGen phase.


Inspired by Lean, the following is a bit inconcrete and imaginary, 
but maybe we could make it so that compiler phases "sign" parts of 
the interface file with the binary hash of the respective 
subcomponents of the phase?
E.g., if all the object files that influence CodeGen (that will later 
be linked into the stage1 compiler) result in a hash of 0xdeadbeef 
before and after the change to Parser.y, we know we can stop 
recompiling Data.List with the stage1 compiler when we see that the 
IR passed to CodeGen didn't change, because the last compile did 
CodeGen with a stage1 compiler with the same hash 0xdeadbeef. The 
0xdeadbeef hash is a proxy for saying "the function CodeGen stayed 
the same", so we can reuse its cached outputs.
Of course, that is utopic without a tool that does the "taint 
analysis" of which modules in GHC influence CodeGen. Probably just 
including all the transitive dependencies of GHC.CmmToAsm suffices, 
but probably that's too crude already. For another example, a change 
to GHC.Utils.Unique would probably entail a full rebuild of the 
compiler because it basically affects all compiler phases.
There are probably parallels with recompilation avoidance in a 
language with staged meta-programming.


Am Fr., 19. Feb. 2021 um 11:42 Uhr schrieb Josef 

Re: Changes to performance testing?

2021-02-21 Thread Richard Eisenberg


> On Feb 21, 2021, at 11:24 AM, Ben Gamari  wrote:
> 
> To mitigate this I would suggest that we allow performance test failures
> in marge-bot pipelines. A slightly weaker variant of this idea would
> instead only allow performance *improvements*. I suspect the latter
> would get most of the benefit, while eliminating the possibility that a
> large regression goes unnoticed.

The value in making performance improvements a test failure is so that patch 
authors can be informed of what they have done, to make sure it matches 
expectations. This need can reasonably be satisfied without stopping merging. 
That is, if Marge can accept performance improvements, while (say) posting to 
each MR involved that it may have contributed to a performance improvement, 
then I think we've done our job here.

On the other hand, a performance degradation is a bug, just like, say, an error 
message regression. Even if it's a combination of commits that cause the 
problem (an actual possibility even for error message regressions), it's still 
a bug that we need to either fix or accept (balanced out by other 
improvements). The pain of debugging this scenario might be mitigated if there 
were a collation of the performance wibbles for each individual commit. This 
information is, in general, available: each commit passed CI on its own, and so 
it should be possible to create a little report with its rows being perf tests 
and its columns being commits or MR #s; each cell in the table would be a 
percentage regression. If we're lucky, the regression Marge sees will be the 
sum(*) of the entries in one of the rows -- this means that we have a simple 
agglomeration of performance degradation. If we're less lucky, the whole will 
not equal the sum of the parts, and some of the patches interfere. In either 
case, the table would suggest a likely place to look next.

(*) I suppose if we're recording percentages, it wouldn't necessarily be the 
actual sum, because percentages are a bit funny. But you get my meaning.

Pulling this all together:
* I'm against the initial proposal of allowing all performance failures by 
Marge. This will allow bugs to accumulate (in my opinion).
* I'm in favor of allowing performance improvements to be accepted by Marge.
* To mitigate against the information loss of Marge accepting performance 
improvements, it would be great if Marge could alert MR authors that a 
cumulative performance improvement took place.
* To mitigate against the annoyance of finding a performance regression in a 
merge commit that does not appear in any component commit, it would be great if 
there were a tool to collect performance numbers from a set of commits and 
present them in a table for further analysis.

These "mitigations" might take work. If labor is impossible to produce to 
complete this work, I'm in favor of simply allowing the performance 
improvements, maybe also filing a ticket about these potential improvements to 
the process.

Richard___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Changes to performance testing?

2021-02-21 Thread Ben Gamari
Hi all,

Recently our performance tests have been causing quite some pain. One
reason for this is due to our new Darwin runners (see #19025), which
(surprisingly) differ significantly in their performance characteristics
(perhaps due to running Big Sur or using native tools provided by nix?).

However, this is further exacerbated by the fact that there are quite a
few people working on compiler performance currently (horray!). This
leads to the following failure mode during Marge jobs:

 1. Merge request A improves test T1234 by 0.5%, which is within the
test's acceptance window and therefore CI passes

 2. Merge request B *also* improves test T1234 by another 0.5%, which
similarly passes CI

 3. Marge tries to merge MRs A and B in a batch but finds that the
combined 1% improvement in T1234 is *outside* the acceptance window.
Consequently, the batch fails.

This is quite painful, especially given that it creates work for those
trying to improve GHC (as the saying goes: no good deed goes
unpunished). 

To mitigate this I would suggest that we allow performance test failures
in marge-bot pipelines. A slightly weaker variant of this idea would
instead only allow performance *improvements*. I suspect the latter
would get most of the benefit, while eliminating the possibility that a
large regression goes unnoticed.

Thoughts?

Cheers,

- Ben


signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs