On December 12, 2018 7:43:10 PM GMT+01:00, Richard Sandiford <richard.sandif...@arm.com> wrote: >Richard Biener <richard.guent...@gmail.com> writes: >> On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford >>> Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf. >>> Also repeated the performance testing (but haven't yet tried an >>> LTO variant; will do that over the weekend). >> >> Any results? > >Sorry, I should've remembered that finding time to run tests is easy, >finding time to analyse them is hard. > >Speed-wise, the impact of the patch for LTO is similar to without, >with 554.roms_r being the main beneficiary for both AArch64 and x86_64. >I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native >-flto=jobserver. > >Size-wise, there are three tests that grow by >=2% on x86_64: > >549.fotonik3d_r: 5.5% >548.exchange2_r: 29.5% >554.roms_r: 39.6%
Uh. With LTO we might have a reasonable guessed profile and you do have a optimize_loop_nest_for_speed guard on the transform? How does compile time fare with the above benchmarks? >The roms increase seems fair enough given the benefit, since the >whole program uses a similar style for handling arrays. > >fotonik3d is a mixed bag. Some versioning conditions are from >things like: > > subroutine foo(a) > real :: a(:,:,:) > a(1,:,:) = ... > end subroutine > >where we version for the middle dimension having a stride of 1. >This could be eliminated if we had more semantic information. > >Other conditions are for things like: > > subroutine foo(a) > real :: a(:,:,:) > a(:,1,:) = ... > end subroutine > >where the pass really does help, but unfortunately those loops >aren't hot and might not even be used at all. > >Some opportunities are cases in which we avoid gather loads, such as >in the "mp" loads in the hot __material_mod_MOD_mat_updatee function. >But mp feeds a gather load itself and these assignments aren't a >critical part of the loop. > >(I'm testing on an AVX-only system, so these are synthesised gathers >rather than real gathers.) > >The growth in 548.exchange2_r comes from reasonable changes to cold >code. >The test spends almost all its time in __brute_force_MOD_digits_2, >which >contains the same code before and after the patch, but which accounts >for less than 1% of .text before the patch. > >> I've skimmed over the updated patch and it looks >> a lot better now. >> >> +bool >> +loop_versioning >> +::find_per_loop_multiplication (address_info &address, >address_term_info &term) >> +{ >> >> is that what coding convention allows? > >Not sure we've settled on something, so I improvised. > >> For grepping I'd then say we should do >> >> bool loop_versioning:: >> find_per_loop_multiplication (...) >> >> ;) > >Sounds good to me. > >> Anywhere else we you use >> >> loop_versioning::foo >> >> so please stick to that. > >Yeah, I used that when I could avoid an argument spilling over 80 >chars, >but I think I missed that the above function now fits into that >category, >Will double-check the others. > >> Otherwise OK. > >Thanks :-) > >> I think I don't see a testcase where we could version both loops in a >nest >> so I'm not sure whether the transform works fine when you are only >> updating SSA form at the very end of the pass. > >You mean something like: > > real :: foo(:,:), bar(:) > > do i=1,n > do j=1,n > foo(i,j) = ... > end do > bar(i) = .. > end do > >? I can add a test if so. Please. >At the moment the pass only looks for direct versioning opportunities >in inner loops, so the assignment to bar wouldn't be treated as a >versioning opportunity. Ah, OK. We should still hoist the version check >for foo outside both loops, which is tested by loop_versioning_4.f90 >for cases in which the outer loop doesn't have its own array >arithmetic, but isn't tested for cases like the above. > >> There may also be some subtle issues with substitute_and_fold being >> applied to non-up-to-date SSA form given it folds stmts looking at >> (single-use!) SSA edges. The single-use guard might be what saves >you >> here (SSA uses in the copies are not yet updated to point to the >> copied DEFs). > >OK. I was hoping that because we only apply substitute_and_fold >to new code, there would be no problem with uses elsewhere. Might be, yes. >Would it be safer to: > > - version all loops we want to version > - update SSA explicitly > - apply substitute and fold to all "new" loops That would be definitely less fishy. But you need to get at the actual 'new' SSA names for the replacements as I guess they'll be rewritten? Or maybe those are not. >? Could we then get away with returning a 0 TODO at the end? Yes. Richard. >Thanks, >Richard