Hey Alfredo, Thanks for taking a look at this. I've taken a look at this before and collected some notes here: https://github.com/niteria/notes/blob/master/PrettyPrintingGHC.md#problems Based on my investigations I concluded that there are places where pretty-printing Asm and Cmm gets accidentally quadratic. If I remember correctly the core of the problem was that even in the "fast" mode the pretty printer was computing the lengths of subtrees in `reduceDoc`, which in some cases made the operations quadratic. I've tried implementing a "just append bytes"-mode without `reduceDoc`, but what stopped me was my lack of understanding of different semantics for 3 kinds of empty Doc's. I think if you took the time to understand it, you could implement it in a way that's not quadratic, reaping a substantial performance win.
Cheers, Bartosz 2017-04-18 11:17 GMT+01:00 Alfredo Di Napoli <alfredo.dinap...@gmail.com>: > Hey Simon, > > thanks for chiming in. I had the same suspect as well, and in fact my > first temptation was to use dlists to avoid costly concatenation (I guess a > Builder shares pretty much the same idea, which is avoid right-appends). I > eventually bailed out as that Pretty module had some functions like sepX or > fillNBE (I might be wrong, going by memory only here) which needed to > *inspect* the current [Doc] we were carrying around, and thus dlists > couldn’t accomplish this in O(1), but forcing it back to a list was > necessary. Am I right in thinking that using a Builder here will suffer the > same malady? Ideally I would like constant time for both appends, left > concat & inspection (as in pattern-matching inspection) but I don’t think > what I’m asking exists, at least not in its functional declination anyway ;) > > That’s why I was thinking to give Deques (or more generally, Sequences) a > go: they don’t have O(1) appends but at least can be inspected in O(1). > Question has to be answered whether this will yield improvement or not, > which I guess depends upon the ratio of inspection / appending we do in the > pretty printing. In that case using dlists or builders might be better. > Last but not least, although the current implementation doesn’t backtrack > anyway, I’m wondering wether switching to a different representation for a > Doc (namely a huge function composition of Token(s), as described in the > paper) could be beneficial as well. > > Do you guys have any opinions? Yesterday I extracted Pretty.hs from the > sourcetree and I’m now planning to produce a criterion benchmark and > compare different implementations, althought it’s a bit hard to predict the > real world usage as I don’t have access to a representative Doc document as > produced by GHC, so my benchs could be all ill-founded. > > Alfredo > > On 18 April 2017 at 12:01, Simon Marlow <marlo...@gmail.com> wrote: > >> Pretty-printing the asm is a likely contender for optimisation, however >> the problem is not the pretty-printing per se. We don't actually use any >> of the backtracking stuff when printing asm, since there's no point nicely >> indenting things or wrapping lines. The overhead is probably all in the >> layer of data structure that we generate in Pretty before it gets dumped >> into raw bytes. Using a ByteString Builder instead might yield some >> improvement. >> >> Cheers >> Simon >> >> On 17 April 2017 at 18:44, Alfredo Di Napoli <alfredo.dinap...@gmail.com> >> wrote: >> >>> Dear all, >>> >>> after sprinkling (ehm, littering) GHC source code with cost centres, I >>> was not surprised to see that roughly 20% of the compilation time (as in >>> .prof) was spent in the core gen/simplification process (10% of the total >>> time) and on the asm code gen (another 10%). >>> >>> I have almost immediately abandoned the idea of try optimising some >>> modules in simplCore (considering my 0-knowledge of GHC internals anyway..) >>> but I have been dwelling on the following: Outputable.hs and Pretty.hs >>> seems to be have been implemented making deliberate use of lists and >>> concatenations between them, which left me wondering if there was room for >>> optimisation there. I have found this interesting paper on the topic: >>> >>> https://www.cs.kent.ac.uk/pubs/2005/2062/content.pdf >>> >>> Now, it’s totally possible that this has been already tried (with no >>> success) but judging from the original copyright of Pretty.hs (dated 2001), >>> it seems it was written prior to the work of Olaf Chitil (the author of the >>> paper). >>> >>> TL;DR I was thinking (even just as a fun exercise to learn more about >>> GHC internals) to leverage the ideas of that paper and switch to a >>> different implementation for `Doc` coupled with the use of lazy dequeues, >>> which *might* increase the performances of the codegen and thus of the >>> compiler overall. Am I fighting a strawman (or flogging a dead horse, pick >>> your rethorical figure :D ) or is there even a tiny chance of this being >>> actually useful? >>> >>> Have a nice evening, >>> >>> Alfredo >>> >>> On 11 April 2017 at 00:47, Ben Gamari <b...@smart-cactus.org> wrote: >>> >>>> Alfredo Di Napoli <alfredo.dinap...@gmail.com> writes: >>>> >>>> > Hey Ben, >>>> > >>>> Hi Alfredo, >>>> >>>> Sorry for the late response! The email queue from the weekend was a bit >>>> longer than I would like. >>>> >>>> > as promised I’m back to you with something more articulated and >>>> hopefully >>>> > meaningful. I do hear you perfectly — probably trying to dive >>>> head-first >>>> > into this without at least a rough understanding of the performance >>>> > hotspots or the GHC overall architecture is going to do me more harm >>>> than >>>> > good (I get the overall picture and I’m aware of the different stages >>>> of >>>> > the GHC compilation pipeline, but it’s far from saying I’m proficient >>>> with >>>> > the architecture as whole). I have also read a couple of years ago >>>> the GHC >>>> > chapter on the “Architeture of Open Source Applications” book, but I >>>> don’t >>>> > know how much that is still relevant. If it is, I guess I should >>>> refresh my >>>> > memory. >>>> > >>>> It sounds like you have done a good amount of reading. That's great. >>>> Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise >>>> it's likely worthwhile diving in. >>>> >>>> > I’m currently trying to move on 2 fronts — please advice if I’m a fool >>>> > flogging a dead horse or if I have any hope of getting anything done >>>> ;) >>>> > >>>> > 1. I’m trying to treat indeed the compiler as a black block (as you >>>> > adviced) trying to build a sufficiently large program where GHC is >>>> not “as >>>> > fast as I would like” (I know that’s a very lame definition of “slow”, >>>> > hehe). In particular, I have built the stage2 compiler with the “prof” >>>> > flavour as you suggested, and I have chosen 2 examples as a reference >>>> > “benchmark” for performance; DynFlags.hs (which seems to have been >>>> > mentioned multiple times as a GHC perf killer) and the >>>> highlighting-kate >>>> > package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 >>>> . >>>> >>>> Indeed, #9221 would be a very interesting ticket to look at. The >>>> highlighting-kate package is interesting in the context of that ticket >>>> as it has a very large amount of parallelism available. >>>> >>>> If you do want to look at #9221, note that the cost centre profiler may >>>> not provide the whole story. In particular, it has been speculated that >>>> the scaling issues may be due to either, >>>> >>>> * threads hitting a blackhole, resulting in blocking >>>> >>>> * the usual scaling limitations of GHC's stop-the-world GC >>>> >>>> The eventlog may be quite useful for characterising these. >>>> >>>> > The idea would be to compile those with -v +RTS -p -hc -RTS enabled, >>>> > look at the output from the .prof file AND the `-v` flag, find any >>>> > hotspot, try to change something, recompile, observe diff, rinse and >>>> > repeat. Do you think I have any hope of making progress this way? In >>>> > particular, I think compiling DynFlags.hs is a bit of a dead-end; I >>>> > whipped up this buggy script which >>>> > escalated into a Behemoth which is compiling pretty much half of the >>>> > compiler once again :D >>>> > >>>> > ``` >>>> > #!/usr/bin/env bash >>>> > >>>> > ../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \ >>>> > -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler >>>> -I../ghc/compiler/stage2 >>>> > \ >>>> > -I../ghc/compiler/stage2/build \ >>>> > -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compile >>>> r/typecheck:../ghc/compiler/basicTypes >>>> > \ >>>> > -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/comp >>>> iler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude >>>> > \ >>>> > -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../g >>>> hc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn >>>> > \ >>>> > -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compil >>>> er/simplCore:../ghc/compile/specialise >>>> > \ >>>> > -fforce-recomp -c $@ >>>> > ``` >>>> > >>>> > I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` >>>> but >>>> > it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because >>>> it’s >>>> > pulling in half of the compiler anyway :D I tried to reuse the .hi >>>> files >>>> > from my stage2 compilation but I failed (GHC was complaining about >>>> > interface file mismatch). Short story short, I don’t think it will be >>>> a >>>> > very agile way to proceed. Am I right? Do you have any recommendation >>>> in >>>> > such sense? Do I have any hope to compile DynFlags.hs in a way which >>>> would >>>> > make this perf investigation feasible? >>>> > >>>> What I usually do in this case is just take the relevant `ghc` command >>>> line directly from the `make` output and execute it manually. I would >>>> imagine your debug cycle would look something like, >>>> >>>> * instrument the compiler >>>> * build stage1 >>>> * use stage2 to build DynFlags using the stage1 compiler (using a >>>> saved command line) >>>> * think >>>> * repeat >>>> >>>> This should only take a few minutes per iteration. >>>> >>>> > The second example (the highlighting-kate package) seems much more >>>> > promising. It takes maybe 1-2 mins on my machine, which is enough to >>>> take a >>>> > look at the perf output. Do you think I should follow this second >>>> lead? In >>>> > principle any 50+ modules package I think would do (better if with a >>>> lot of >>>> > TH ;) ) but this seems like a low-entry barrier start. >>>> > >>>> > 2. The second path I’m exploring is simply to take a less holistic >>>> approach >>>> > and try to dive in into a performance ticket like the ones listed >>>> here: >>>> > https://www.reddit.com/r/haskell/comments/45q90s/is_anything >>>> _being_done_to_remedy_the_soul/czzq6an/ >>>> > Maybe some are very specific, but it seems like fixing small things >>>> and >>>> > move forward could help giving me understanding of different >>>> sub-parts of >>>> > GHC, which seems less intimidating than the black-box approach. >>>> > >>>> Do you have any specific tickets from these lists that you found >>>> interesting? >>>> >>>> > In conclusion, what do you think is the best approach, 1 or 2, both or >>>> > none? ;) >>>> >>>> I would say that it largely depends upon what you feel most comfortable >>>> with. If you feel up for it, I think #9221 would be a nice, fairly >>>> self-contained, yet high-impact ticket which would be worth spending a >>>> few days diving further into. >>>> >>>> Cheers, >>>> >>>> - Ben >>>> >>>> >>> >>> _______________________________________________ >>> ghc-devs mailing list >>> ghc-devs@haskell.org >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs >>> >>> >> > > _______________________________________________ > ghc-devs mailing list > ghc-devs@haskell.org > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs > >
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs