Richard Sandiford wrote:
If PARENT_A is a parent of allocno A, propagate_allocno_info does
the equivalent of:

   ALLOCNO_HARD_REG_COSTS(parent_a)[i] += ALLOCNO_HARD_REG_COSTS(a)[i]

which is what you'd expect from the comments.  But after that,
should each update ALLOCNO_HARD_REG_COSTS(a)[i] be reflected in
ALLOCNO_HARD_REG_COSTS(parent_a)[i]?  It isn't entirely clear
from the code.

E.g. process_regs_for_copy (which I believe is called after
propagate_allocno_info) updates ALLOCNO_HARD_REG_COSTS:

  ALLOCNO_HARD_REG_COSTS (a)[index] -= cost;

and I can't see how this update would ever propagate to parent allocnos.
That is true. I've checked and found that changes in process_regs_for_copy are not propagated to upper level. It is a part of add_copies and copies themself are propagated later but not changes in hard register costs. It should be fixed. Thank you for finding this.
Also, color_pass has a loop that updates the costs of child allocnos
after a parent has been allocated:


I should think more about this. This propagation is actually intention what is better to use in subloops. It could be fixed as 1) updating costs in superloops or 2) restoring costs in subloops and changing costs in subloops and superloops based on the allocation result in subloops and superloops.
  /* Update costs of the corresponding allocnos (not caps) in the
     subloops.  */
  for (subloop_node = loop_tree_node->subloops;
       subloop_node != NULL;
       subloop_node = subloop_node->subloop_next)
    {
...
   }

A lot of the fields updated here are described as "accumulated",
but the updates to SUBLOOP_ALLOCNO aren't reflected in A (or A's parents).

You might be wondering why this matters.  Well, ALLOCNO_HARD_REG_COSTS
is sometimes accessed after coloring is complete. move_spill_restore is one such function, and has:

      FOR_EACH_ALLOCNO (a, ai)
        {
          regno = ALLOCNO_REGNO (a);
          loop_node = ALLOCNO_LOOP_TREE_NODE (a);
          if (ALLOCNO_CAP_MEMBER (a) != NULL
              || ALLOCNO_CAP (a) != NULL
              || (hard_regno = ALLOCNO_HARD_REGNO (a)) < 0
              || loop_node->children == NULL
              /* don't do the optimization because it can create
                 copies and the reload pass can spill the allocno set
                 by copy although the allocno will not get memory
                 slot.  */
              || ira_reg_equiv_invariant_p[regno]
              || ira_reg_equiv_const[regno] != NULL_RTX
              || !bitmap_bit_p (loop_node->border_allocnos, ALLOCNO_NUM (a)))
            continue;
          mode = ALLOCNO_MODE (a);
          rclass = ALLOCNO_COVER_CLASS (a);
          index = ira_class_hard_reg_index[rclass][hard_regno];
          ira_assert (index >= 0);
--->   cost = (ALLOCNO_MEMORY_COST (a)
--->           - (ALLOCNO_HARD_REG_COSTS (a) == NULL
--->              ? ALLOCNO_COVER_CLASS_COST (a)
--->              : ALLOCNO_HARD_REG_COSTS (a)[index]));
          for (subloop_node = loop_node->subloops;
               subloop_node != NULL;
               subloop_node = subloop_node->subloop_next)
            {
              ira_assert (subloop_node->bb == NULL);
              subloop_allocno = subloop_node->regno_allocno_map[regno];
              if (subloop_allocno == NULL)
                continue;
              /* We have accumulated cost.  To get the real cost of
                 allocno usage in the loop we should subtract costs of
                 the subloop allocnos.  */
--->       cost -= (ALLOCNO_MEMORY_COST (subloop_allocno)
--->                - (ALLOCNO_HARD_REG_COSTS (subloop_allocno) == NULL
--->                   ? ALLOCNO_COVER_CLASS_COST (subloop_allocno)
--->                   : ALLOCNO_HARD_REG_COSTS (subloop_allocno)[index]));

...
As I understand it, the code marked "--->" is trying to calculate the
cost to A _in isolation_ of spilling to memory.  Then the other (unmarked)
cost adjustments calculate the effect this has on the child allocnos.
If that's right, we're assuming here that ALLOCNO_MEMORY_COST,
ALLOCNO_COVER_CLASS_COST and ALLOCNO_HARD_REG_COST are still
accumulated.  The comment above the second "--->" block seems to
bear out this intention.

The problem is that the costs aren't fully accumulated.  After the
color_pass code quoted above, it is no longer true to say that:

ALLOCNO_HARD_REG_COSTS (a)[i] - \Sigma ALLOCNO_HARD_REG_COSTS (subloop_allocno)[i])

is the cost of using I for A itself.  Instead, each execution of:

cost = (ira_register_move_cost[mode][rclass][rclass] * (exit_freq + enter_freq));
              ALLOCNO_HARD_REG_COSTS (subloop_allocno)[index] -= cost;

will also subtract COST from the cost of spilling A (the parent allocno)
in move_spill_restore.  Is this intentional?

No, it was not intentionally. We should fix it. As I understood this implementation (inaccuracy in cost accumulation) hurts targets with slow memory like MIPS although it still helps targets like x86/x86_64.
Although I suspect it isn't intentional, I can imagine it doesn't show
up much on targets whose memory move costs are significantly higher than
their register move costs.

Which brings us to MIPS.  The big question there is: what do we
do with the accumulator registers?  Do we put them in the same
cover class as GPRs?

Or perhaps that's jumping the gun.  Perhaps the first question is:
should we mark the accumulator registers as fixed, or at least hide them
from the register allocator?  I'm planning to do the latter for MIPS16,
but I don't think it's a good idea for normal MIPS for two reasons:

  - The DSP ASE provides 4 accumulators.  We want to apply
    normal register allocation to them.

  - Some targets have multiply-accumulate instructions that operate on
    LO and HI.  But it isn't always a win to use them.  If a target has
    both multiply-accumulate _and_ pipelined three-operand multiplication
    instructions, it is often better to use the latter for parallel
    multiply-accumulate chains.  We've traditionally treated the
    choice as a register-allocation problem, which seems to have
    worked reasonably well.

    Also, the macc instruction on some targets can copy the LO result
    to a GPR too.  The register allocator can currently take advantage
    of that, allocating a GPR when it's useful and not wasting one
    otherwise.  (From what I've seen in the past, JPEG FFT loops tend
    to be on the borderline as far as register pressure on MIPS goes,
    so this can be an important win.)

But there are only a limited number of accumulator registers (1 or 4,
depending on the target).  There's quite a high likelihood that
any given value will need to be spilled from the accumulators
at some point.  When that happens, it's better to spill to a GPR
than it is to spill to memory, since any load or store has to go
through a GPR anyway.  It therefore seems better to put GPRs and
accumulator registers in the same cover class.

It better to put GPRs and ACCs in the same class if it is better to spill ACC_REGS to GPR than to memory but it should be reflected in some way in memory and register costs.
We currently give moves between GPRs and accumulators a higher cost than
GPR loads and stores.[*]  On the targets for which this cost is accurate,
we _don't_ want to use LO and HI as spill space.  We also don't want
to move between one accumulator and another if we can help it.
And IRA generally seems happy with this.

  [*] Which isn't an accurate reflection of all targets, but that's
      another story.  We ought eventually to put this in the CPU cost
      table.

The hitch is that the cost of storing an accumulator to memory
is the cost of a GPR<->accumulator move plus the cost of a load
or store.  The cost of moving between one accumulator and another
is the cost of two GPR<->accumulator moves.  Both of these aggregate
costs are accurate, to the extent that the constituent costs are
accurate (see [*] above).  So we have a situation in which the
worst-case register<->register cost (acc<->acc) outweighs the
worst-cost register<->memory cost (acc<->mem).  And that goes
against the cover class documentation.

The documentation can be changed. It is just a recommendation or how I see it. We have to separate reg classes into non-intersected ones because Chaitin-Briggs coloring needs this for its work. The bigger cover classes, the more probability that RA puts more pseudos into hard registers. On the other hand CB coloring does not understand register and memory move costs (assign_hard_reg does understand but it can be used for other coloring algorithms as Chow's priority coloring). It is hard to find a balance in defining cover classes to put more pseudos into hard-registers and still generate cheap code. I should think about better IRA_COVER_CLASSES description.
For the most part things Just Work.  But in the code quoted
above, this cost:

cost = (ira_register_move_cost[mode][rclass][rclass] * (exit_freq + enter_freq));
              ALLOCNO_HARD_REG_COSTS (subloop_allocno)[index] -= cost;

outweights this cost:

                  cost
                    += (ira_memory_move_cost[mode][rclass][1] * exit_freq
                        + ira_memory_move_cost[mode][rclass][0] * enter_freq);

and so move_spill_restore can actually spill an allocno in cases
where ira-emit.c wouldn't need to do anything.  In particular,
I saw this in zlib's infblock.c (part of CSiBE).  We had a pseudo
register P with several allocnos, and every allocno was allocated
the same register ($21, as it happens).  $21 wasn't used by any other
pseudo register, so no spill code was needed for P.  But move_spill_restore
nevertheless spilled some of the allocnos, introducing lots of redundant
loads and stores.  If you disable move_spill_restore, ira-emit.c does
the right thing.

So the move_spill_restore problem affects brok^Wunsual targets like MIPS
more than most, and I realise that, in the scenario above, MIPS is going
against design.  On the other hand, the code doesn't look quite as I'd
expect.

In summary:

  - Vlad, could you clarify when costs are and aren't accumulated?

The costs should be always accumulated. That was original design. I rewrote IR building so many times to speed it up and probably lost some places where it is not accumulated.
  - I'd appreciate suggestions about how to handle the MIPS accumulators.

It is hard to say without actual experiments.  I'd try to
1) define ACCs and GPRs as one cover class for subtargets where the move between GPRs and ACCs is cheaper than memory.
2) Separate the for the rest of subtargets.

Define correct register and memory costs in any case. To avoid spilling to HI and LO, define move costs for them which resulting in avoiding them in assign_hard_reg.
There's an auxillary question about whether we should use
REGNO_REG_CLASS more often in cases where a hard register has already
been chosen, since this can make a big difference on some targets.

Yes, I think we should. I have some comments marked by ??? to use hard register class instead of cover_class if we know the actual hard-register. But I remember one problem that costs might be wrong for some targets for some class to class movements. It is safer to use cover classes. Still I think we should move to use hard register classes where it is possible and fix register_move_costs if their wrong for some register class pairs.
For the record, the current CSiBE results for MIPS are:

Sorry, Richard. It is hard to understand to me. Is the first number is for IRA?
-O0 -EB -mips16 -mabi=32 -mhard-float          4499177  3998513 :   88.87%
...
-Os -EB -mips16 -mabi=32 -mhard-float          2839673  2840465 :  100.03%
...
and from what I've seen, the increases seem to be from this kind of needless
spilling.  (A lot of the individual -Os tests are better with IRA,
but the ones for which this problem triggers are much worse).

Richard, thanks for the deep analysis IRA code pitfalls. It is always interesting and useful to read your messages.

Reply via email to