Hi Magnus, your winning variant gives me a nice boost on my thinkpad:
pch, standard: real 17m52.367s user 52m20.730s sys 4m53.711s pch, your variant: real 15m0.514s user 46m6.466s sys 2m38.371s (non-pch is ~19-20 minutes WTC) With those numbers, I might start using pch again on low powered machines. .. Thomas On Fri, Nov 2, 2018 at 12:14 PM Magnus Ihse Bursie <[email protected]> wrote: > > > On 2018-11-02 11:39, Magnus Ihse Bursie wrote: > > On 2018-11-02 00:53, Ioi Lam wrote: > >> Maybe precompiled.hpp can be periodically (weekly?) updated by a > >> robot, which parses the dependencies files generated by gcc, and pick > >> the most popular N files? > > I think that's tricky to implement automatically. However, I've done > > more or less, that, and I've got some wonderful results! :-) > > Ok, I'm done running my tests. > > TL;DR: I've managed to reduce wall-clock time from 2m 45s (with pch) or > 2m 23s (without pch), to 1m 55s. The cpu time spent went from 52m 27s > (with pch) or 55m 30s (without pch) to 41m 10s. This is a huge gain for > our automated builds! And a clear improvement even for the ordinary > developer. > > The list of included header files is reduced to just 37. The winning > combination was to include all header files that was included in more > than 130 different files, but to exclude all files with the name > "*.inline.hpp". Hopefully, a further gain of not pulling in the > *.inline.hpp files is that the risk of pch/non-pch failures will diminish. > > However, these 37 files in turn pull in an additional 201 header files. > Of these, three are *.inline.hpp: > share/jfr/recorder/checkpoint/types/traceid/jfrTraceIdBits.inline.hpp, > os_cpu/linux_x86/bytes_linux_x86.inline.hpp and > os_cpu/linux_x86/copy_linux_x86.inline.hpp. This looks like a problem > with the header files to me. > > With some exceptions (mostly related to JFR), these additional 200 files > have "generic" looking names (like share/gc/g1/g1_globals.hpp), which > indicate to me that it is reasonable to have them in this list, just as > the list of the original 37 tended to be quite general and high-level > includes. However, some files (like > share/jfr/instrumentation/jfrEventClassTransformer.hpp) has maybe leaked > in where they should not really be. It might be worth letting a hotspot > engineer spend some cycles to check up these files and see if anything > can be improved. > > Caveats: I have only run this on my local linux build with the default > server JVM configuration. Other machines will have different sweet > spots. Other JVM variants/feature combinations will have different sweet > spots. And, most importantly, I have not tested this at all on Windows. > Nevertheless, I'm almost prepared to suggest a patch that uses this > selection of files if running on gcc, just as is, because of the speed > improvements I measured. > > And some data: > > Here is my log from my runs. The "on or above" means the cutoff I used > for how many files that needed to include the files that were selected. > As you can see, there is not much difference between cutoffs between > 130-150, or (without the inline files) between 110 and 150. (There were > a lot of additional inline files in the positions below 130.) With all > other equal, I'd prefer a solution with fewer files. That is less likely > to go bad. > > real 2m45.623s > user 52m27.813s > sys 5m27.176s > hotspot with original pch > > real 2m23.837s > user 55m30.448s > sys 3m39.739s > hotspot without pch > > real 1m59.533s > user 42m50.019s > sys 3m0.893s > hotspot new pch on or above 250 > > real 1m58.937s > user 42m18.994s > sys 3m0.245s > hotspot new pch on or above 200 > > real 2m0.729s > user 42m16.636s > sys 2m57.125s > hotspot new pch on or above 170 > > real 1m58.064s > user 42m9.618s > sys 2m57.635s > hotspot new pch on or above 150 > > real 1m58.053s > user 42m9.796s > sys 2m58.732s > hotspot new pch on or above 130 > > real 2m3.364s > user 42m54.818s > sys 3m2.737s > hotspot new pch on or above 100 > > real 2m6.698s > user 44m30.434s > sys 3m12.015s > hotspot new pch on or above 70 > > real 2m0.598s > user 41m17.810s > sys 2m56.258s > hotspot new pch on or above 150 without inline > > real 1m55.981s > user 41m10.076s > sys 2m51.983s > hotspot new pch on or above 130 without inline > > real 1m56.449s > user 41m10.667s > sys 2m53.808s > hotspot new pch on or above 110 without inline > > And here is the "winning" list (which I declared as "on or above 130, > without inline"). I encourage everyone to try this on their own system, > and report back the results! > > #ifndef DONT_USE_PRECOMPILED_HEADER > # include "classfile/classLoaderData.hpp" > # include "classfile/javaClasses.hpp" > # include "classfile/systemDictionary.hpp" > # include "gc/shared/collectedHeap.hpp" > # include "gc/shared/gcCause.hpp" > # include "logging/log.hpp" > # include "memory/allocation.hpp" > # include "memory/iterator.hpp" > # include "memory/memRegion.hpp" > # include "memory/resourceArea.hpp" > # include "memory/universe.hpp" > # include "oops/instanceKlass.hpp" > # include "oops/klass.hpp" > # include "oops/method.hpp" > # include "oops/objArrayKlass.hpp" > # include "oops/objArrayOop.hpp" > # include "oops/oop.hpp" > # include "oops/oopsHierarchy.hpp" > # include "runtime/atomic.hpp" > # include "runtime/globals.hpp" > # include "runtime/handles.hpp" > # include "runtime/mutex.hpp" > # include "runtime/orderAccess.hpp" > # include "runtime/os.hpp" > # include "runtime/thread.hpp" > # include "runtime/timer.hpp" > # include "services/memTracker.hpp" > # include "utilities/align.hpp" > # include "utilities/bitMap.hpp" > # include "utilities/copy.hpp" > # include "utilities/debug.hpp" > # include "utilities/exceptions.hpp" > # include "utilities/globalDefinitions.hpp" > # include "utilities/growableArray.hpp" > # include "utilities/macros.hpp" > # include "utilities/ostream.hpp" > # include "utilities/ticks.hpp" > #endif // !DONT_USE_PRECOMPILED_HEADER > > /Magnus > > > > > I'd still like to run some more tests, but preliminiary data indicates > > that there is much to be gained by having a more sensible list of > > files in the precompiled header. > > > > The fewer files we got on this list, the less likely it is to become > > (drastically) outdated. So I don't think we need to do this > > automatically, but perhaps manually every now and then when we feel > > build times are increasing. > > > > /Magnus > > > >> > >> - Ioi > >> > >> > >> On 11/1/18 4:38 PM, David Holmes wrote: > >>> It's not at all obvious to me that the way we use PCH is the > >>> right/best way to use it. We dump every header we think it would be > >>> good to precompile into precompiled.hpp and then only ask gcc to > >>> precompile it. That results in a ~250MB file that has to be read > >>> into and processed for every source file! That doesn't seem very > >>> efficient to me. > >>> > >>> Cheers, > >>> David > >>> > >>> On 2/11/2018 3:18 AM, Erik Joelsson wrote: > >>>> Hello, > >>>> > >>>> My point here, which wasn't very clear, is that Mac and Linux seem > >>>> to lose just as much real compile time. The big difference in these > >>>> tests was rather the number of cpus in the machine (32 threads in > >>>> the linux box vs 8 on the mac). The total amount of work done was > >>>> increased when PCH was disabled, that's the user time. Here is my > >>>> theory on why the real (wall clock) time was not consistent with > >>>> user time between these experiments can be explained: > >>>> > >>>> With pch the time line (simplified) looks like this: > >>>> > >>>> 1. Single thread creating PCH > >>>> 2. All cores compiling C++ files > >>>> > >>>> When disabling pch it's just: > >>>> > >>>> 1. All cores compiling C++ files > >>>> > >>>> To gain speed with PCH, the time spent in 1 much be less than the > >>>> time saved in 2. The potential time saved in 2 goes down as the > >>>> number of cpus go up. I'm pretty sure that if I repeated the > >>>> experiment on Linux on a smaller box (typically one we use in CI), > >>>> the results would look similar to Macosx, and similarly, if I had > >>>> access to a much bigger mac, it would behave like the big Linux > >>>> box. This is why I'm saying this should be done for both or none of > >>>> these platforms. > >>>> > >>>> In addition to this, the experiment only built hotspot. If you we > >>>> would instead build the whole JDK, then the time wasted in 1 in the > >>>> PCH case would be negated to a large extent by other build targets > >>>> running concurrently, so for a full build, PCH is still providing > >>>> value. > >>>> > >>>> The question here is that if the value of PCH isn't very big, > >>>> perhaps it's not worth it if it's also creating as much grief as > >>>> described here. There is no doubt that there is value however. And > >>>> given the examination done by Magnus, it seems this value could be > >>>> increased. > >>>> > >>>> The main reason why we haven't disabled PCH in CI before this. We > >>>> really really want to get CI builds fast. We don't have a ton of > >>>> over capacity to just throw at it. PCH made builds faster, so we > >>>> used them. My other reason is consistency between builds. > >>>> Supporting multiple different modes of building creates the > >>>> potential for inconsistencies. For that reason I would definitely > >>>> not support having PCH on by default, but turned off in our > >>>> CI/dev-submit. We pick one or the other as the official build > >>>> configuration, and we stick with the official build configuration > >>>> for all builds of any official capacity (which includes CI). > >>>> > >>>> In the current CI setup, we have a bunch of tiers that execute one > >>>> after the other. The jdk-submit currently only runs tier1. In tier2 > >>>> I've put slowdebug builds with PCH disabled, just to help verify a > >>>> common developer configuration. These builds are not meant to be > >>>> used for testing or anything like that, they are just run for > >>>> verification, which is why this is ok. We could argue that it would > >>>> make sense to move the linux-x64-slowdebug without pch build to > >>>> tier1 so that it's included in dev-submit. > >>>> > >>>> /Erik > >>>> > >>>> On 2018-11-01 03:38, Magnus Ihse Bursie wrote: > >>>>> > >>>>> > >>>>> On 2018-10-31 00:54, Erik Joelsson wrote: > >>>>>> Below are the corresponding numbers from a Mac, (Mac Pro (Late > >>>>>> 2013), 3.7 GHz, Quad-Core Intel Xeon E5, 16 GB). To be clear, the > >>>>>> -npch is without precompiled headers. Here we see a slight > >>>>>> degradation when disabling on both user time and wall clock time. > >>>>>> My guess is that the user time increase is about the same, but > >>>>>> because of a lower cpu count, the extra load is not as easily > >>>>>> covered. > >>>>>> > >>>>>> These tests were run with just building hotspot. This means that > >>>>>> the precompiled header is generated alone on one core while > >>>>>> nothing else is happening, which would explain this degradation > >>>>>> in build speed. If we were instead building the whole product, we > >>>>>> would see a better correlation between user and real time. > >>>>>> > >>>>>> Given the very small benefit here, it could make sense to disable > >>>>>> precompiled headers by default for Linux and Mac, just as we did > >>>>>> with ccache. > >>>>>> > >>>>>> I do know that the benefit is huge on Windows though, so we > >>>>>> cannot remove the feature completely. Any other comments? > >>>>> > >>>>> Well, if you show that it is a loss in time on macosx to disable > >>>>> precompiled headers, and no-one (as far as I've seen) has > >>>>> complained about PCH on mac, then why not keep them on as default > >>>>> there? That the gain is small is no argument to lose it. (I > >>>>> remember a time when you were hunting seconds in the build time ;-)) > >>>>> > >>>>> On linux, the story seems different, though. People experience PCH > >>>>> as a problem, and there is a net loss of time, at least on > >>>>> selected testing machines. It makes sense to turn it off as > >>>>> default, then. > >>>>> > >>>>> /Magnus > >>>>> > >>>>>> > >>>>>> /Erik > >>>>>> > >>>>>> macosx-x64 > >>>>>> real 4m13.658s > >>>>>> user 27m17.595s > >>>>>> sys 2m11.306s > >>>>>> > >>>>>> macosx-x64-npch > >>>>>> real 4m27.823s > >>>>>> user 30m0.434s > >>>>>> sys 2m18.669s > >>>>>> > >>>>>> macosx-x64-debug > >>>>>> real 5m21.032s > >>>>>> user 35m57.347s > >>>>>> sys 2m20.588s > >>>>>> > >>>>>> macosx-x64-debug-npch > >>>>>> real 5m33.728s > >>>>>> user 38m10.311s > >>>>>> sys 2m27.587s > >>>>>> > >>>>>> macosx-x64-slowdebug > >>>>>> real 3m54.439s > >>>>>> user 25m32.197s > >>>>>> sys 2m8.750s > >>>>>> > >>>>>> macosx-x64-slowdebug-npch > >>>>>> real 4m11.987s > >>>>>> user 27m59.857s > >>>>>> sys 2m18.093s > >>>>>> > >>>>>> > >>>>>> On 2018-10-30 14:00, Erik Joelsson wrote: > >>>>>>> Hello, > >>>>>>> > >>>>>>> On 2018-10-30 13:17, Aleksey Shipilev wrote: > >>>>>>>> On 10/30/2018 06:26 PM, Ioi Lam wrote: > >>>>>>>>> Is there any advantage of using precompiled headers on Linux? > >>>>>>>> I have measured it recently on shenandoah repositories, and > >>>>>>>> fastdebug/release build times have not > >>>>>>>> improved with or without PCH. Actually, it gets worse when you > >>>>>>>> touch a single header that is in PCH > >>>>>>>> list, and you end up recompiling the entire Hotspot. I would be > >>>>>>>> in favor of disabling it by default. > >>>>>>> I just did a measurement on my local workstation (2x8 cores x2 > >>>>>>> ht Ubuntu 18.04 using Oracle devkit GCC 7.3.0). I ran "time make > >>>>>>> hotspot" with clean build directories. > >>>>>>> > >>>>>>> linux-x64: > >>>>>>> real 4m6.657s > >>>>>>> user 61m23.090s > >>>>>>> sys 6m24.477s > >>>>>>> > >>>>>>> linux-x64-npch > >>>>>>> real 3m41.130s > >>>>>>> user 66m11.824s > >>>>>>> sys 4m19.224s > >>>>>>> > >>>>>>> linux-x64-debug > >>>>>>> real 4m47.117s > >>>>>>> user 75m53.740s > >>>>>>> sys 8m21.408s > >>>>>>> > >>>>>>> linux-x64-debug-npch > >>>>>>> real 4m42.877s > >>>>>>> user 84m30.764s > >>>>>>> sys 4m54.666s > >>>>>>> > >>>>>>> linux-x64-slowdebug > >>>>>>> real 3m54.564s > >>>>>>> user 44m2.828s > >>>>>>> sys 6m22.785s > >>>>>>> > >>>>>>> linux-x64-slowdebug-npch > >>>>>>> real 3m23.092s > >>>>>>> user 55m3.142s > >>>>>>> sys 4m10.172s > >>>>>>> > >>>>>>> These numbers support your claim. Wall clock time is actually > >>>>>>> increased with PCH enabled, but total user time is decreased. > >>>>>>> Does not seem worth it to me. > >>>>>>>>> It's on by default and we keep having > >>>>>>>>> breakage where someone would forget to add #include. The > >>>>>>>>> latest instance is JDK-8213148. > >>>>>>>> Yes, we catch most of these breakages in CIs. Which tells me > >>>>>>>> adding it to jdk-submit would cover > >>>>>>>> most of the breakage during pre-integration testing. > >>>>>>> jdk-submit is currently running what we call "tier1". We do have > >>>>>>> builds of Linux slowdebug with precompiled headers disabled in > >>>>>>> tier2. We also build solaris-sparcv9 in tier1 which does not > >>>>>>> support precompiled headers at all, so to not be caught in > >>>>>>> jdk-submit you would have to be in Linux specific code. The > >>>>>>> example bug does not seem to be that. Mach5/jdk-submit was down > >>>>>>> over the weekend and yesterday so my suspicion is the offending > >>>>>>> code in this case was never tested. > >>>>>>> > >>>>>>> That said, given that we get practically no benefit from PCH on > >>>>>>> Linux/GCC, we should probably just turn it off by default for > >>>>>>> Linux and/or GCC. I think we need to investigate Macos as well > >>>>>>> here. > >>>>>>> > >>>>>>> /Erik > >>>>>>>> -Aleksey > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >> > > >
