On Wed, Aug 31, 2011 at 2:17 PM, Ilya Enkovich <[email protected]> wrote:
> Hello Richard,
>
> Thanks for the review!
>
>> The fact that you have to adjust gcc.dg/tree-ssa/pr38533.c looks problematic
>> to me. Can you investigate the problem report, look at the geneated
>> code with the atom default of the param and see whether it's still
>> reasonably "fixed" (maybe you'd already done this)?
> This test was created as a regression test to the problem in
> linearize_expr_tree which moves all statements down to the first
> modified one during reassociation increasing registers pressure. Test
> has a lot of definitions which are all ORed like this:
> def r1
> def r2
> s = r1 or r2
> def r3
> s = s or r3
> def r4
> s = s or r4
> ...
> and it checks that register pressure is not increased after
> reassociation by simple scan of two sequential defs. If we use
> reassociation width higher than 1 then test will fails because we need
> to increase register pressure to get parallelism and finally get code
> like this:
> def r1
> def r2
> def r3
> t1 = r1 or r2
> s = s or r3
> def r4
> def r5
> s = s or t1
> t1 = r4 or r5
> ...
> So, I had to fix a test.
Ok. Thanks for explaining.
>> + /* Check if we may use less width and still compute sequence for
>> + the same time. It will allow us to reduce registers usage. */
>> + while (width > 1
>> + && get_required_cycles (ops_num, width - 1) == cycles_best)
>> + width--;
>>
>> I suppose get_required_cycles () is monotonic in width? Would it
>> make sense to binary search the best width then (I realize the
>> default is 1 and the only target differing has 2, but ...)? Maybe at
>> least add a comment to this effect? Or not decrement by one
>> but instead divide by two on each iteration (which is the same for 1 and 2
>> ...)?
>> It's also all a mapping that is constant - we should be able to
>> pre-compute it for all values, eventually "compressing" it to a
>> much simpler formula width = f (cpu_width, ops_num)?
> I replaced sequential search with a binary search. I did not pay much
> attention to this function because do not think it is really time
> consuming compared to other parts of reassociation phase. Am I too
> optimistic? If you think it might significantly affect compile time I
> can introduce a table of pre-computed values (or make it later as a
> separate fix).
+ /* Get the minimal time required for sequence computation. */
+ cycles_best = get_required_cycles (ops_num, width);
+
+ /* Check if we may use less width and still compute sequence for
+ the same time. It will allow us to reduce registers usage. */
+ width_min = 1;
+ while (width > width_min)
+ {
+ int width_mid = (width + width_min) / 2;
+
+ if (get_required_cycles (ops_num, width_mid) == cycles_best)
+ width = width_mid;
+ else if (width_min < width_mid)
+ width_min = width_mid;
+ else
+ break;
+ }
this seems to not allow cycles_best to drop with lower width, but
that it can't should be an implementation detail of get_required_cycles.
To make it not so, can you add a comment before the loop, like
/* get_required_cycles is monotonically increasing with lower width
so we can perform a binary search for the minimal width that still
results in the optimal cycle count. */
> I made all other fixes as you suggested.
With the above change the non-x86 specifc parts are ok. Please get
approval for them from a x86 maintainer.
Thanks,
Richard.
> Bootstrapped and checked on x86_64-linux.
>
> Thanks,
> Ilya
> ---
> gcc/
>
> 2011-08-31 Enkovich Ilya <[email protected]>
>
> PR middle-end/44382
> * target.def (reassociation_width): New hook.
>
> * doc/tm.texi.in (reassociation_width): Likewise.
>
> * doc/tm.texi (reassociation_width): Likewise.
>
> * doc/invoke.texi (tree-reassoc-width): New param documented.
>
> * hooks.h (hook_int_uint_mode_1): New default hook.
>
> * hooks.c (hook_int_uint_mode_1): Likewise.
>
> * config/i386/i386.h (ix86_tune_indices): Add
> X86_TUNE_REASSOC_INT_TO_PARALLEL and
> X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
> (TARGET_REASSOC_INT_TO_PARALLEL): New.
> (TARGET_REASSOC_FP_TO_PARALLEL): Likewise.
>
> * config/i386/i386.c (initial_ix86_tune_features): Add
> X86_TUNE_REASSOC_INT_TO_PARALLEL and
> X86_TUNE_REASSOC_FP_TO_PARALLEL.
>
> (ix86_reassociation_width) implementation of
> new hook for i386 target.
>
> * params.def (PARAM_TREE_REASSOC_WIDTH): New param added.
>
> * tree-ssa-reassoc.c (get_required_cycles): New function.
> (get_reassociation_width): Likewise.
> (swap_ops_for_binary_stmt): Likewise.
> (rewrite_expr_tree_parallel): Likewise.
>
> (rewrite_expr_tree): Refactored. Part of code moved into
> swap_ops_for_binary_stmt.
>
> (reassociate_bb): Now checks reassociation width to be used
> and call rewrite_expr_tree_parallel instead of rewrite_expr_tree
> if needed.
>
> gcc/testsuite/
>
> 2011-08-31 Enkovich Ilya <[email protected]>
>
> * gcc.dg/tree-ssa/pr38533.c (dg-options): Added option
> --param tree-reassoc-width=1.
>
> * gcc.dg/tree-ssa/reassoc-24.c: New test.
> * gcc.dg/tree-ssa/reassoc-25.c: Likewise.
>