On Thu, 11 Apr 2019, Jan Hubicka wrote: > > On Thu, 11 Apr 2019, Jan Hubicka wrote: > > > > > Hi, > > > the LTO streaming forks for every partition. With the number of > > > partitions incrased to 128 and relatively large memory usage (around > > > 5GB) needed to WPA firefox this causes kernel to spend a lot of time > > > probably by copying the page tables. > > > > > > This patch makes the streamer to for only lto_parallelism times > > > and strem num_partitions/lto_paralleism in each thread. > > > I have also added parameter because currently -flto=jobserv leads > > > to unlimited parallelism. This should be fixed by conneting to Make's > > > jobsever and build our own mini jobserver to distribute partitions > > > between worker threads, but this seems bit too involved for last minute > > > change in stage4. I plan to work on this and hopefully bacport it to .2 > > > release. > > > > > > I have tested the performance on by 32CPU 64threads box and got best > > > wall time with 32 partitions and therefore I set it by default. I get > > > > > > --param max-lto-streaming-parallelism=1 > > > Time variable usr sys > > > wall GGC > > > phase stream out : 50.65 ( 30%) 20.66 ( 61%) 71.38 > > > ( 35%) 921 kB ( 0%) > > > TOTAL : 170.73 33.69 204.64 > > > 7459610 kB > > > > > > --param max-lto-streaming-parallelism=4 > > > phase stream out : 13.79 ( 11%) 6.80 ( 35%) 20.94 > > > ( 14%) 155 kB ( 0%) > > > TOTAL : 130.26 19.68 150.46 > > > 7458844 kB > > > > > > --param max-lto-streaming-parallelism=8 > > > phase stream out : 8.94 ( 7%) 5.21 ( 29%) 14.15 > > > ( 10%) 83 kB ( 0%) > > > TOTAL : 125.28 18.09 143.54 > > > 7458773 kB > > > > > > --param max-lto-streaming-parallelism=16 > > > phase stream out : 4.56 ( 4%) 4.34 ( 25%) 9.46 > > > ( 7%) 35 kB ( 0%) > > > TOTAL : 122.60 17.21 140.56 > > > 7458725 kB > > > > > > --param max-lto-streaming-parallelism=32 > > > phase stream out : 2.34 ( 2%) 5.69 ( 31%) 8.03 > > > ( 6%) 15 kB ( 0%) > > > TOTAL : 118.53 18.36 137.08 > > > 7458705 kB > > > > > > --param max-lto-streaming-parallelism=64 > > > phase stream out : 1.63 ( 1%) 15.76 ( 55%) 17.40 > > > ( 12%) 13 kB ( 0%) > > > TOTAL : 122.17 28.66 151.00 > > > 7458702 kB > > > > > > --param max-lto-streaming-parallelism=256 > > > phase stream out : 1.28 ( 1%) 9.24 ( 41%) 10.53 > > > ( 8%) 13 kB ( 0%) > > > TOTAL : 116.78 22.56 139.53 > > > 7458702 kB > > > > > > Note that it is bit odd that 64 leads to worse results that full > > > parallelism but it seems to reproduce relatively well. Also the usr/sys > > > times for streaming are not representative since they do not account sys > > > time of the forked threads. I am not sure where the fork time is > > > accounted. > > > > > > Generally it seems that the forking performance is not at all that > > > bad and scales reasonably, but I still we should limit the default for > > > something less than 128 we do now. Definitly there are diminishing > > > returns after increasing from 16 or 32 and memory use goes up > > > noticeably. With current trunk memory use also does not seem terribly > > > bad (less global stream streaming makes the workers cheaper) and in all > > > memory traces I collected it is dominated by compilation stage during > > > the full rebuild. > > > > > > I did similar tests for cc1 binary. There the relative time spent in > > > streaming is lower so it goes from 17% to 1% (for parallelism 1 and 32 > > > respectively) > > > > > > Bootstrapped/regtested x86_64-linux, OK? > > > > Please document the new param in invoke.texi. Otherwise looks good > > to me. Btw, do we actually allocate garbage at write-out time? > > Thus, would using threads work as well? > > It is on my TODO to get this working. Last time i checked by adding > abort into ggc_alloc there was some occurences but I think that can be > cleanded up. > > I wonder how much performance hit we would get for enabling pthreads for > lto1 binary and thus building libbackend with it?
Is there any performance impact before the first thread creation? (besides eventually a few well-predicted if (threads_are_running) checks?) Richard.