Hey Bartosz, thanks for that, looks like I won the jackpot today ;)
Sounds like your notes are an excellent starting point. Will try to get a better understanding of that module myself, to see if we can reap some perf win here, but the fact you “have been there before” looks like this is a promising path to take. Cheers! Alfredo On 18 April 2017 at 13:01, Bartosz Nitka <nite...@gmail.com> wrote: > Hey Alfredo, > > Thanks for taking a look at this. I've taken a look at this before and > collected some notes here: https://github.com/niteria/notes/blob/master/ > PrettyPrintingGHC.md#problems > Based on my investigations I concluded that there are places where > pretty-printing Asm and Cmm gets accidentally quadratic. > If I remember correctly the core of the problem was that even in the > "fast" mode the pretty printer was computing the lengths of subtrees in > `reduceDoc`, which in some cases made the operations quadratic. > I've tried implementing a "just append bytes"-mode without `reduceDoc`, > but what stopped me was my lack of understanding of different semantics for > 3 kinds of empty Doc's. > I think if you took the time to understand it, you could implement it in a > way that's not quadratic, reaping a substantial performance win. > > Cheers, > Bartosz > > 2017-04-18 11:17 GMT+01:00 Alfredo Di Napoli <alfredo.dinap...@gmail.com>: > >> Hey Simon, >> >> thanks for chiming in. I had the same suspect as well, and in fact my >> first temptation was to use dlists to avoid costly concatenation (I guess a >> Builder shares pretty much the same idea, which is avoid right-appends). I >> eventually bailed out as that Pretty module had some functions like sepX or >> fillNBE (I might be wrong, going by memory only here) which needed to >> *inspect* the current [Doc] we were carrying around, and thus dlists >> couldn’t accomplish this in O(1), but forcing it back to a list was >> necessary. Am I right in thinking that using a Builder here will suffer the >> same malady? Ideally I would like constant time for both appends, left >> concat & inspection (as in pattern-matching inspection) but I don’t think >> what I’m asking exists, at least not in its functional declination anyway ;) >> >> That’s why I was thinking to give Deques (or more generally, Sequences) a >> go: they don’t have O(1) appends but at least can be inspected in O(1). >> Question has to be answered whether this will yield improvement or not, >> which I guess depends upon the ratio of inspection / appending we do in the >> pretty printing. In that case using dlists or builders might be better. >> Last but not least, although the current implementation doesn’t backtrack >> anyway, I’m wondering wether switching to a different representation for a >> Doc (namely a huge function composition of Token(s), as described in the >> paper) could be beneficial as well. >> >> Do you guys have any opinions? Yesterday I extracted Pretty.hs from the >> sourcetree and I’m now planning to produce a criterion benchmark and >> compare different implementations, althought it’s a bit hard to predict the >> real world usage as I don’t have access to a representative Doc document as >> produced by GHC, so my benchs could be all ill-founded. >> >> Alfredo >> >> On 18 April 2017 at 12:01, Simon Marlow <marlo...@gmail.com> wrote: >> >>> Pretty-printing the asm is a likely contender for optimisation, however >>> the problem is not the pretty-printing per se. We don't actually use any >>> of the backtracking stuff when printing asm, since there's no point nicely >>> indenting things or wrapping lines. The overhead is probably all in the >>> layer of data structure that we generate in Pretty before it gets dumped >>> into raw bytes. Using a ByteString Builder instead might yield some >>> improvement. >>> >>> Cheers >>> Simon >>> >>> On 17 April 2017 at 18:44, Alfredo Di Napoli <alfredo.dinap...@gmail.com >>> > wrote: >>> >>>> Dear all, >>>> >>>> after sprinkling (ehm, littering) GHC source code with cost centres, I >>>> was not surprised to see that roughly 20% of the compilation time (as in >>>> .prof) was spent in the core gen/simplification process (10% of the total >>>> time) and on the asm code gen (another 10%). >>>> >>>> I have almost immediately abandoned the idea of try optimising some >>>> modules in simplCore (considering my 0-knowledge of GHC internals anyway..) >>>> but I have been dwelling on the following: Outputable.hs and Pretty.hs >>>> seems to be have been implemented making deliberate use of lists and >>>> concatenations between them, which left me wondering if there was room for >>>> optimisation there. I have found this interesting paper on the topic: >>>> >>>> https://www.cs.kent.ac.uk/pubs/2005/2062/content.pdf >>>> >>>> Now, it’s totally possible that this has been already tried (with no >>>> success) but judging from the original copyright of Pretty.hs (dated 2001), >>>> it seems it was written prior to the work of Olaf Chitil (the author of the >>>> paper). >>>> >>>> TL;DR I was thinking (even just as a fun exercise to learn more about >>>> GHC internals) to leverage the ideas of that paper and switch to a >>>> different implementation for `Doc` coupled with the use of lazy dequeues, >>>> which *might* increase the performances of the codegen and thus of the >>>> compiler overall. Am I fighting a strawman (or flogging a dead horse, pick >>>> your rethorical figure :D ) or is there even a tiny chance of this being >>>> actually useful? >>>> >>>> Have a nice evening, >>>> >>>> Alfredo >>>> >>>> On 11 April 2017 at 00:47, Ben Gamari <b...@smart-cactus.org> wrote: >>>> >>>>> Alfredo Di Napoli <alfredo.dinap...@gmail.com> writes: >>>>> >>>>> > Hey Ben, >>>>> > >>>>> Hi Alfredo, >>>>> >>>>> Sorry for the late response! The email queue from the weekend was a bit >>>>> longer than I would like. >>>>> >>>>> > as promised I’m back to you with something more articulated and >>>>> hopefully >>>>> > meaningful. I do hear you perfectly — probably trying to dive >>>>> head-first >>>>> > into this without at least a rough understanding of the performance >>>>> > hotspots or the GHC overall architecture is going to do me more harm >>>>> than >>>>> > good (I get the overall picture and I’m aware of the different >>>>> stages of >>>>> > the GHC compilation pipeline, but it’s far from saying I’m >>>>> proficient with >>>>> > the architecture as whole). I have also read a couple of years ago >>>>> the GHC >>>>> > chapter on the “Architeture of Open Source Applications” book, but I >>>>> don’t >>>>> > know how much that is still relevant. If it is, I guess I should >>>>> refresh my >>>>> > memory. >>>>> > >>>>> It sounds like you have done a good amount of reading. That's great. >>>>> Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise >>>>> it's likely worthwhile diving in. >>>>> >>>>> > I’m currently trying to move on 2 fronts — please advice if I’m a >>>>> fool >>>>> > flogging a dead horse or if I have any hope of getting anything done >>>>> ;) >>>>> > >>>>> > 1. I’m trying to treat indeed the compiler as a black block (as you >>>>> > adviced) trying to build a sufficiently large program where GHC is >>>>> not “as >>>>> > fast as I would like” (I know that’s a very lame definition of >>>>> “slow”, >>>>> > hehe). In particular, I have built the stage2 compiler with the >>>>> “prof” >>>>> > flavour as you suggested, and I have chosen 2 examples as a reference >>>>> > “benchmark” for performance; DynFlags.hs (which seems to have been >>>>> > mentioned multiple times as a GHC perf killer) and the >>>>> highlighting-kate >>>>> > package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 >>>>> . >>>>> >>>>> Indeed, #9221 would be a very interesting ticket to look at. The >>>>> highlighting-kate package is interesting in the context of that ticket >>>>> as it has a very large amount of parallelism available. >>>>> >>>>> If you do want to look at #9221, note that the cost centre profiler may >>>>> not provide the whole story. In particular, it has been speculated that >>>>> the scaling issues may be due to either, >>>>> >>>>> * threads hitting a blackhole, resulting in blocking >>>>> >>>>> * the usual scaling limitations of GHC's stop-the-world GC >>>>> >>>>> The eventlog may be quite useful for characterising these. >>>>> >>>>> > The idea would be to compile those with -v +RTS -p -hc -RTS enabled, >>>>> > look at the output from the .prof file AND the `-v` flag, find any >>>>> > hotspot, try to change something, recompile, observe diff, rinse and >>>>> > repeat. Do you think I have any hope of making progress this way? In >>>>> > particular, I think compiling DynFlags.hs is a bit of a dead-end; I >>>>> > whipped up this buggy script which >>>>> > escalated into a Behemoth which is compiling pretty much half of the >>>>> > compiler once again :D >>>>> > >>>>> > ``` >>>>> > #!/usr/bin/env bash >>>>> > >>>>> > ../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \ >>>>> > -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler >>>>> -I../ghc/compiler/stage2 >>>>> > \ >>>>> > -I../ghc/compiler/stage2/build \ >>>>> > -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compile >>>>> r/typecheck:../ghc/compiler/basicTypes >>>>> > \ >>>>> > -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/comp >>>>> iler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude >>>>> > \ >>>>> > -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../g >>>>> hc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn >>>>> > \ >>>>> > -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compil >>>>> er/simplCore:../ghc/compile/specialise >>>>> > \ >>>>> > -fforce-recomp -c $@ >>>>> > ``` >>>>> > >>>>> > I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` >>>>> but >>>>> > it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because >>>>> it’s >>>>> > pulling in half of the compiler anyway :D I tried to reuse the .hi >>>>> files >>>>> > from my stage2 compilation but I failed (GHC was complaining about >>>>> > interface file mismatch). Short story short, I don’t think it will >>>>> be a >>>>> > very agile way to proceed. Am I right? Do you have any >>>>> recommendation in >>>>> > such sense? Do I have any hope to compile DynFlags.hs in a way which >>>>> would >>>>> > make this perf investigation feasible? >>>>> > >>>>> What I usually do in this case is just take the relevant `ghc` command >>>>> line directly from the `make` output and execute it manually. I would >>>>> imagine your debug cycle would look something like, >>>>> >>>>> * instrument the compiler >>>>> * build stage1 >>>>> * use stage2 to build DynFlags using the stage1 compiler (using a >>>>> saved command line) >>>>> * think >>>>> * repeat >>>>> >>>>> This should only take a few minutes per iteration. >>>>> >>>>> > The second example (the highlighting-kate package) seems much more >>>>> > promising. It takes maybe 1-2 mins on my machine, which is enough to >>>>> take a >>>>> > look at the perf output. Do you think I should follow this second >>>>> lead? In >>>>> > principle any 50+ modules package I think would do (better if with a >>>>> lot of >>>>> > TH ;) ) but this seems like a low-entry barrier start. >>>>> > >>>>> > 2. The second path I’m exploring is simply to take a less holistic >>>>> approach >>>>> > and try to dive in into a performance ticket like the ones listed >>>>> here: >>>>> > https://www.reddit.com/r/haskell/comments/45q90s/is_anything >>>>> _being_done_to_remedy_the_soul/czzq6an/ >>>>> > Maybe some are very specific, but it seems like fixing small things >>>>> and >>>>> > move forward could help giving me understanding of different >>>>> sub-parts of >>>>> > GHC, which seems less intimidating than the black-box approach. >>>>> > >>>>> Do you have any specific tickets from these lists that you found >>>>> interesting? >>>>> >>>>> > In conclusion, what do you think is the best approach, 1 or 2, both >>>>> or >>>>> > none? ;) >>>>> >>>>> I would say that it largely depends upon what you feel most comfortable >>>>> with. If you feel up for it, I think #9221 would be a nice, fairly >>>>> self-contained, yet high-impact ticket which would be worth spending a >>>>> few days diving further into. >>>>> >>>>> Cheers, >>>>> >>>>> - Ben >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> ghc-devs mailing list >>>> ghc-devs@haskell.org >>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs >>>> >>>> >>> >> >> _______________________________________________ >> ghc-devs mailing list >> ghc-devs@haskell.org >> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs >> >> >
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs