> > > > With part suffixes we also may want to merge specially, since the > > entry_count of the split part does not correspond to entry_count of the > > original function. > > > > I wonder, does partitioned function work with the google tool? I > > remember it had limitations in this respect. > > > > Yes, Here are some examples. > > _Z17expand_assignmentP9tree_nodeS0_b.part.0 total:7045 head:297 > 0: 297 > 20: 297 > > _Z17expand_assignmentP9tree_nodeS0_b total:1488 head:277 > 1: 277 > 9: 277 > Here, we should keep the head as it is as head is for offset 1.
I actually had in ming the .cold partition (-freprder-blocks-and-partitoin) but this is interesting too. We should track if we stripped .part suffix and in that case do not merge in head counts. However # of invocations of _Z17expand_assignmentP9tree_nodeS0_b.part.0 should always be strictly lower than #of invocations of _Z17expand_assignmentP9tree_nodeS0_b this is not reflected by head count, since I suppose the second is inlined into some contexts which makes the execution to be accounted spearately into their inline instances. So merging the profiles will also lead to inconsistencies making the .part variant to seem more hot than it is... > > > _Z19recompute_dominator13cdi_directionP15basic_block_def.part.0 total:1182 > head:13 > 0: 13 > 3: 13 > 11: 13 > > _Z19recompute_dominator13cdi_directionP15basic_block_def total:11 head:9 > 1: 0 > 3: 0 > 9: 9 > > Here also, we should keep the head as it is as head is for offset 9. > > _Z22init_attr_rdwr_indicesP8hash_mapI16rdwr_access_hash11attr_access21simple_hashmap_traitsI19default_hash_traitsIS0_ES1_EEP9tree_node.part.0 > total:85 head:5 > 0: 8 > 11: 0 > 12: 0 > 16: 0 > 17: 0 > 18: 0 > 20: 0 > 21: 0 > 25: 0 > 25.1: 2 > 27: 2 > 30: 0 > 31: 0 > 34: 0 > 35: 2 > 38: 2 > 38.1: 2 > 39: 2 > 41: 2 > 46: 2 > 52.1: 0 > 54: 0 > 54.1: 0 > 56: 8 > 57: 0 > 59: 0 > 62: 0 > 63: 3 > 65: 0 > 71.1: 0 > 77: 0 > 78: 0 > 81: 3 > 84: 2 > 86: 0 > 89: 0 > 91: 0 > 92: 0 > 92.1: 0 > 98: 0 > 99: 0 > 103: 0 > 108: 0 > 108.1: 0 > 111: 0 > 114: 0 > 120: 1 > 124: 0 > 125: 0 > 127: 0 > 128: 0 > 130: 0 > 131: 0 > 134: 0 > 139: 0 > 140: 0 > 143: 1 > 6: lookup_attribute total:40 > 4: 5 > > > _Z22init_attr_rdwr_indicesP8hash_mapI16rdwr_access_hash11attr_access21simple_hashmap_traitsI19default_hash_traitsIS0_ES1_EEP9tree_node > total:212 head:71 > 2: 71 > _Z22init_attr_rdwr_indicesP8hash_mapI16rdwr_access_hash11attr_access21simple_hashmap_traitsI19default_hash_traitsIS0_ES1_EEP9tree_node.part.0:5 > 143: 141 > > This looks odd. Looks like create_gcovt getting mixed up with the offset of > inlined functions I am not sure I follow what you mean here? This is my current benchmark run with -Ofast -mtune=native (without LTO) comparing no feedback (base) to autofdo (peak) 500.perlbench_r 1 167 9.51 * 1 155 10.3 * 502.gcc_r 1 132 10.7 * 1 126 11.2 * 505.mcf_r 1 226 7.16 * 1 225 7.20 * 520.omnetpp_r 1 203 6.47 * 1 203 6.47 * 523.xalancbmk_r NR NR 525.x264_r 1 84.7 20.7 * 1 90.7 19.3 * 531.deepsjeng_r 1 208 5.50 * 1 209 5.47 * 541.leela_r 1 295 5.61 * 1 318 5.21 * 548.exchange2_r 1 85.9 30.5 * 1 93.3 28.1 * 557.xz_r 1 225 4.79 * 1 220 4.90 * Est. SPECrate2017_int_base 9.13 Est. SPECrate2017_int_peak 9.05 So there are regressions in x264, deepsjeng, leela and exchange neighter of them very bad. I think it would be interesting to understand 541.leela_r first. Honza > > Thanksm > Kugabn