On Thu, 11 Apr 2019, Jan Hubicka wrote:

> > On Thu, 11 Apr 2019, Jan Hubicka wrote:
> > 
> > > Hi,
> > > the LTO streaming forks for every partition. With the number of
> > > partitions incrased to 128 and relatively large memory usage (around
> > > 5GB) needed to WPA firefox this causes kernel to spend a lot of time
> > > probably by copying the page tables.
> > > 
> > > This patch makes the streamer to for only lto_parallelism times
> > > and strem num_partitions/lto_paralleism in each thread.
> > > I have also added parameter because currently -flto=jobserv leads
> > > to unlimited parallelism.  This should be fixed by conneting to Make's
> > > jobsever and build our own mini jobserver to distribute partitions
> > > between worker threads, but this seems bit too involved for last minute
> > > change in stage4.  I plan to work on this and hopefully bacport it to .2
> > > release.
> > > 
> > > I have tested the performance on by 32CPU 64threads box and got best
> > > wall time with 32 partitions and therefore I set it by default.  I get
> > > 
> > > --param max-lto-streaming-parallelism=1
> > > Time variable                                   usr           sys         
> > >  wall               GGC
> > >  phase stream out                   :  50.65 ( 30%)  20.66 ( 61%)  71.38 
> > > ( 35%)     921 kB (  0%)
> > >  TOTAL                              : 170.73         33.69        204.64  
> > >       7459610 kB
> > > 
> > > --param max-lto-streaming-parallelism=4
> > >  phase stream out                   :  13.79 ( 11%)   6.80 ( 35%)  20.94 
> > > ( 14%)     155 kB (  0%)
> > >  TOTAL                              : 130.26         19.68        150.46  
> > >       7458844 kB
> > > 
> > > --param max-lto-streaming-parallelism=8
> > >  phase stream out                   :   8.94 (  7%)   5.21 ( 29%)  14.15 
> > > ( 10%)      83 kB (  0%)
> > >  TOTAL                              : 125.28         18.09        143.54  
> > >       7458773 kB
> > > 
> > > --param max-lto-streaming-parallelism=16
> > >  phase stream out                   :   4.56 (  4%)   4.34 ( 25%)   9.46 
> > > (  7%)      35 kB (  0%)
> > >  TOTAL                              : 122.60         17.21        140.56  
> > >       7458725 kB
> > > 
> > > --param max-lto-streaming-parallelism=32
> > >  phase stream out                   :   2.34 (  2%)   5.69 ( 31%)   8.03 
> > > (  6%)      15 kB (  0%)
> > >  TOTAL                              : 118.53         18.36        137.08  
> > >       7458705 kB
> > > 
> > > --param max-lto-streaming-parallelism=64
> > >  phase stream out                   :   1.63 (  1%)  15.76 ( 55%)  17.40 
> > > ( 12%)      13 kB (  0%)
> > >  TOTAL                              : 122.17         28.66        151.00  
> > >       7458702 kB
> > > 
> > > --param max-lto-streaming-parallelism=256
> > >  phase stream out                   :   1.28 (  1%)   9.24 ( 41%)  10.53 
> > > (  8%)      13 kB (  0%)
> > >  TOTAL                              : 116.78         22.56        139.53  
> > >       7458702 kB
> > > 
> > > Note that it is bit odd that 64 leads to worse results that full
> > > parallelism but it seems to reproduce relatively well. Also the usr/sys
> > > times for streaming are not representative since they do not account sys
> > > time of the forked threads. I am not sure where the fork time is
> > > accounted.
> > > 
> > > Generally it seems that the forking performance is not at all that
> > > bad and scales reasonably, but I still we should limit the default for
> > > something less than 128 we do now. Definitly there are diminishing
> > > returns after increasing from 16 or 32 and memory use goes up
> > > noticeably. With current trunk memory use also does not seem terribly
> > > bad (less global stream streaming makes the workers cheaper) and in all
> > > memory traces I collected it is dominated by compilation stage during
> > > the full rebuild.
> > > 
> > > I did similar tests for cc1 binary. There the relative time spent in
> > > streaming is lower so it goes from 17% to 1% (for parallelism 1 and 32
> > > respectively)
> > > 
> > > Bootstrapped/regtested x86_64-linux, OK?
> > 
> > Please document the new param in invoke.texi.  Otherwise looks good
> > to me.  Btw, do we actually allocate garbage at write-out time?
> > Thus, would using threads work as well?
> 
> It is on my TODO to get this working.  Last time i checked by adding
> abort into ggc_alloc there was some occurences but I think that can be
> cleanded up.
> 
> I wonder how much performance hit we would get for enabling pthreads for
> lto1 binary and thus building libbackend with it?

Is there any performance impact before the first thread creation?
(besides eventually a few well-predicted if (threads_are_running) checks?)

Richard.

Reply via email to