Re: On CI

John Ericson Sun, 21 Feb 2021 21:54:15 -0800

I'm not opposed to some effort going into this, but I would stronglyopposite putting all our effort there. Incremental CI can cut multiplehours to < mere minutes, especially with the test suite beingembarrassingly parallel. There simply no way optimizations to thecompiler independent from sharing a cache between CI runs can getanywhere close to that return on investment.

(FWIW, I'm also skeptical that the people complaining about GHCperformance know what's hurting them most. For example, afternon-incrementality, the next slowest thing is linking, which is...notdone by GHC! But all that is a separate conversation.)


John

On 2/19/21 2:42 PM, Richard Eisenberg wrote:

There are some good ideas here, but I want to throw out another one:put all our effort into reducing compile times. There is a loud pleato do this on Discourse<https://discourse.haskell.org/t/call-for-ideas-forming-a-technical-agenda/1901/24>,and it would both solve these CI problems and also help everyone else.
This isn't to say to stop exploring the ideas here. But since time ismostly fixed, tackling compilation times in general may be the bestway out of this. Ben's survey of other projects (thanks!) shows thatwe're way, way behind in how long our CI takes to run.
Richard
On Feb 19, 2021, at 7:20 AM, Sebastian Graf <[email protected]<mailto:[email protected]>> wrote:
Recompilation avoidance
I think in order to cache more in CI, we first have to invest sometime in fixing recompilation avoidance in our bootstrapped build system.
I just tested on a hadrian perf ticky build: Adding one line of*comment* in the compiler causes
  * a (pretty slow, yet negligible) rebuild of the stage1 compiler
  * 2 minutes of RTS rebuilding (Why do we have to rebuild the RTS?
    It doesn't depend in any way on the change I made)
  * apparent full rebuild the libraries
  * apparent full rebuild of the stage2 compiler
That took 17 minutes, a full build takes ~45minutes. So theredefinitely is some caching going on, but not nearly as much as therecould be.I know there have been great and boring efforts on compilerdeterminism in the past, but either it's not good enough or our buildsystem needs fixing.I think a good first step to assert would be to make sure that thehash of the stage1 compiler executable doesn't change if I onlychange a comment.I'm aware there probably is stuff going on, like embedding configuredates in interface files and executables, that would need to go, butif possible this would be a huge improvement.
On the other hand, we can simply tack on a [skip ci] to the commitmessage, as I did forhttps://gitlab.haskell.org/ghc/ghc/-/merge_requests/4975<https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4975>. Variantslike [skip tests] or [frontend] could help to identify which tests torun by default.
Lean
I had a chat with a colleague about how they do CI for Lean.Apparently, CI turnaround time including tests is generally 25minutes (~15 minutes for the build) for a complete pipeline, testing6 different OSes and configurations in parallel:https://github.com/leanprover/lean4/actions/workflows/ci.yml<https://github.com/leanprover/lean4/actions/workflows/ci.yml>They utilise ccache to cache the clang-based C++-backend, so thatthey only have to re-run the front- and middle-end. In effect, theytake advantage of the fact that the "function" clang, in contrast tothe "function" stage1 compiler, stays the same.It's hard to achieve that for GHC, where a complete compiler pipelinecomes as one big, fused "function": An external tool can never becertain that a change to Parser.y could not affect the CodeGen phase.
Inspired by Lean, the following is a bit inconcrete and imaginary,but maybe we could make it so that compiler phases "sign" parts ofthe interface file with the binary hash of the respectivesubcomponents of the phase?E.g., if all the object files that influence CodeGen (that will laterbe linked into the stage1 compiler) result in a hash of 0xdeadbeefbefore and after the change to Parser.y, we know we can stoprecompiling Data.List with the stage1 compiler when we see that theIR passed to CodeGen didn't change, because the last compile didCodeGen with a stage1 compiler with the same hash 0xdeadbeef. The0xdeadbeef hash is a proxy for saying "the function CodeGen stayedthe same", so we can reuse its cached outputs.Of course, that is utopic without a tool that does the "taintanalysis" of which modules in GHC influence CodeGen. Probably justincluding all the transitive dependencies of GHC.CmmToAsm suffices,but probably that's too crude already. For another example, a changeto GHC.Utils.Unique would probably entail a full rebuild of thecompiler because it basically affects all compiler phases.There are probably parallels with recompilation avoidance in alanguage with staged meta-programming.
Am Fr., 19. Feb. 2021 um 11:42 Uhr schrieb Josef Svenningsson viaghc-devs <[email protected] <mailto:[email protected]>>:
    Doing "optimistic caching" like you suggest sounds very
    promising. A way to regain more robustness would be as follows.
    If the build fails while building the libraries or the stage2
    compiler, this might be a false negative due to the optimistic
    caching. Therefore, evict the "optimistic caches" and restart
    building the libraries. That way we can validate that the build
    failure was a true build failure and not just due to the
    aggressive caching scheme.

    Just my 2p

    Josef

    ------------------------------------------------------------------------
    *From:* ghc-devs <[email protected]
    <mailto:[email protected]>> on behalf of Simon Peyton
    Jones via ghc-devs <[email protected]
    <mailto:[email protected]>>
    *Sent:* Friday, February 19, 2021 8:57 AM
    *To:* John Ericson <[email protected]
    <mailto:[email protected]>>; ghc-devs
    <[email protected] <mailto:[email protected]>>
    *Subject:* RE: On CI

     1. Building and testing happen together. When tests failure
        spuriously, we also have to rebuild GHC in addition to
        re-running the tests. That's pure waste.
        https://gitlab.haskell.org/ghc/ghc/-/issues/13897
        
<https://nam06.safelinks.protection.outlook.com/?url=https://gitlab.haskell.org/ghc/ghc/-/issues/13897&data=04%7c01%[email protected]%7C3d503922473f4cd0543f08d8d48522b2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637493018301253098%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0=%7C3000&sdata=FG2fyYCXbacp69Q8Il6GE0aX+7ZLNkH1u84NA/VMjQc=&reserved=0>
        tracks this more or less.

    I don’t get this.  We have to build GHC before we can test it,
    don’t we?
    2 .  We don't cache between jobs.
    This is, I think, the big one.   We endlessly build the exact
    same binaries.
    There is a problem, though.  If we make **any** change in GHC,
    even a trivial refactoring, its binary will change slightly.  So
    now any caching build system will assume that anything built by
    that GHC must be rebuilt – we can’t use the cached version.  That
    includes all the libraries and the stage2 compiler.  So caching
    can save all the preliminaries (building the initial Cabal, and
    large chunk of stage1, since they are built with the same
    bootstrap compiler) but after that we are dead.
    I don’t know any robust way out of this.  That small change in
    the source code of GHC might be trivial refactoring, or it might
    introduce a critical mis-compilation which we really want to see
    in its build products.
    However, for smoke-testing MRs, on every architecture, we could
    perhaps cut corners. (Leaving Marge to do full diligence.)  For
    example, we could declare that if we have the result of compiling
    library module X.hs with the stage1 GHC in the last full commit
    in master, then we can re-use that build product rather than
    compiling X.hs with the MR’s slightly modified stage1 GHC.  That
    **might** be wrong; but it’s usually right.
    Anyway, there are big wins to be had here.
    Simon

    *From:*ghc-devs <[email protected]
    <mailto:[email protected]>> *On Behalf Of *John Ericson
    *Sent:* 19 February 2021 03:19
    *To:* ghc-devs <[email protected] <mailto:[email protected]>>
    *Subject:* Re: On CI

    I am also wary of us to deferring checking whole platforms and
    what not. I think that's just kicking the can down the road, and
    will result in more variance and uncertainty. It might be alright
    for those authoring PRs, but it will make Ben's job keeping the
    system running even more grueling.

    Before getting into these complex trade-offs, I think we should
    focus on the cornerstone issue that CI isn't incremental.

     1. Building and testing happen together. When tests failure
        spuriously, we also have to rebuild GHC in addition to
        re-running the tests. That's pure waste.
        https://gitlab.haskell.org/ghc/ghc/-/issues/13897
        
<https://nam06.safelinks.protection.outlook.com/?url=https://gitlab.haskell.org/ghc/ghc/-/issues/13897&data=04%7c01%[email protected]%7C3d503922473f4cd0543f08d8d48522b2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637493018301253098%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0=%7C3000&sdata=FG2fyYCXbacp69Q8Il6GE0aX+7ZLNkH1u84NA/VMjQc=&reserved=0>
        tracks this more or less.
     2. We don't cache between jobs. Shake and Make do not enforce
        dependency soundness, nor cache-correctness when the build
        plan itself changes, and this had made this hard/impossible
        to do safely. Naively this only helps with stage 1 and not
        stage 2, but if we have separate stage 1 and --freeze1 stage
        2 builds, both can be incremental. Yes, this is also lossy,
        but I only see it leading to false failures not false
        acceptances (if we can also test the stage 1 one), so I
        consider it safe. MRs that only work with a slow full build
        because ABI can so indicate.

    The second, main part is quite hard to tackle, but I strongly
    believe incrementality is what we need most, and what we should
    remain focused on.

    John

    _______________________________________________
    ghc-devs mailing list
    [email protected] <mailto:[email protected]>
    http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
    <http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs>

_______________________________________________
ghc-devs mailing list
[email protected] <mailto:[email protected]>
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________
ghc-devs mailing list
[email protected]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________
ghc-devs mailing list
[email protected]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Re: On CI

Reply via email to