Re: RFC: seeking insight on store_data_bypass_p (recog.c)

2017-04-24 Thread Jeff Law

On 04/14/2017 09:58 AM, Richard Sandiford wrote:


.md files do have the option of using a single rtl instruction to
represent a sequence of several machine instructions but:

(a) they're then effectively asking the target-independent code to
 treat the sequence "as-if" it was a single indivisble instruction.

(b) that hampers scheduling in lots of ways, so should be avoided
 unless there's really no alternative.  One problem is that it
 stops other machine instructions from being scheduled in the
 sequence.  Another is that it makes it harder to describe the
 microarchitecture effects of the sequence, since more than one
 instruction is going through the pipeline.

So yeah, if a target does put several machine instructions into
a single rtl instruction, and one of those instructions is a store,
using store_data_bypass_p on it is going to give poor results.
But it would give poor results in general, even without the bypass.
I think it's case of "don't do that".

Sometimes it is (or was) useful to treat multiple machine
instructions as single rtl instructions during early rtl
optimisation.  It's still better to split them into individual
machine instructions for scheduling though, via define_split or
define_insn_and_split.

Agreed 100% with everything Richard says here.

I wouldn't lose any sleep if the store bypass code missed cases where 
multiple instructions are implemented in a single insn.


Jeff


Re: RFC: seeking insight on store_data_bypass_p (recog.c)

2017-04-14 Thread Richard Sandiford
Pat Haugen  writes:
> On 04/12/2017 06:33 PM, Kelvin Nilsen wrote:
>> 
>> 1. As input arguments, out_insn represents an rtl expression that
>> potentially "produces" a store to memory and in_insn represents an rtl
>> expression that potentially "consumes" a value recently stored to memory.
>> 
> You have this reversed, the code is trying to find situations where
> out_insn is producing a value that in_insn will be storing to memory.
>
>> 2. If the memory store produced matches the memory fetch consumed, this
>> function returns true to indicate that this sequence of two instructions
>> qualifies for a special "bypass" latency that represents the fact that
>> the fetch will obtain the value out of the write buffer.  So, whereas
>> the instruction scheduler might normally expect that this sequence of
>> two instructions would experience Load-Hit-Store penalties associated
>> with cache coherency hardware costs, since these two instruction qualify
>> for the store_data_bypass optimization, the instruction scheduler counts
>> the latency as only 1 or 2 cycles (potentially).  [This is what I
>> understand, but I may be wrong, so please correct me if so.]
>> 
> In general, yes, if the function returns true then the sequence has been
> identified and the target will take appropriate action (adjusting
> latency or whatever).
>
>
> As for the remainder below dealing with PARALLELs, I don't have any
> history on that so hopefully others can chime in. For the rs6000 port, I
> don't see the handling of multiple SET operations making much sense, but
> then again I don't know if it will actually occur either based on the
> places where store_data_bypass_p is used.
>
> -Pat
>

Reordering the message, sorry:

>> 2. A "bigger" concern is that any time any SETs are buried within a
>> PARALLEL tree, I'm not sure the answer produced by this function, as
>> currently implemented, is at all reliable:
>> 
>>  a) PARALLEL does not necessarily mean all of its subtrees happen in
>> parallel on hardware.  It just means that there is no sequencing imposed
>> by the source code, so the final order in which the multiple subtrees
>> beneath the PARALLEL node is not known at this stage of compilation.
>>
>>  b) It seems to me that it doesn't really make sense to speak of whether
>> a whole bunch of producers combined with a whole bunch of consumers
>> qualify for an optimized store data bypass latency.  If we say that they
>> do qualify (as a group), which pair(s) of producer and consumer machine
>> instructions qualify?  It seems we need to know which producer matches
>> with which consumer in order to know where the bypass latencies "fit"
>> into the schedule.
>> 
>>  c) Furthermore, if it turns out that the "arbitrary" order in which the
>> producer instructions and consumer instructions are emitted places too
>> much "distance" between a producer and the matching consumer, then it is
>> possible that by the time the hardware executes the consumer, the stored
>> value is no longer in the write buffer, so even though we might have
>> "thought" two PARALLEL rtl expressions qualified for the store bypass
>> optimization, we really should have returned false.

The bypass function only operates on individual rtl instructions:
i.e. one producer and one consumer.  Usually rtl instructions
map to single machine instructions, with PARALLELs being used if
the machine instruction does more than one thing (more below).

.md files do have the option of using a single rtl instruction to
represent a sequence of several machine instructions but:

(a) they're then effectively asking the target-independent code to
treat the sequence "as-if" it was a single indivisble instruction.

(b) that hampers scheduling in lots of ways, so should be avoided
unless there's really no alternative.  One problem is that it
stops other machine instructions from being scheduled in the
sequence.  Another is that it makes it harder to describe the
microarchitecture effects of the sequence, since more than one
instruction is going through the pipeline.

So yeah, if a target does put several machine instructions into
a single rtl instruction, and one of those instructions is a store,
using store_data_bypass_p on it is going to give poor results.
But it would give poor results in general, even without the bypass.
I think it's case of "don't do that".

Sometimes it is (or was) useful to treat multiple machine
instructions as single rtl instructions during early rtl
optimisation.  It's still better to split them into individual
machine instructions for scheduling though, via define_split or
define_insn_and_split.

>> 3. Actually, what I described above is only the "simple" case.  It may
>> be that the rtl for either out_insn or in_insn is really a parallel
>> clause with multiple rtl trees beneath it.  In this case, we compare the
>> subtrees in a "similar" way to see if the compound expressions qualify
>> for the 

Re: RFC: seeking insight on store_data_bypass_p (recog.c)

2017-04-13 Thread Pat Haugen
On 04/12/2017 06:33 PM, Kelvin Nilsen wrote:
> 
> 1. As input arguments, out_insn represents an rtl expression that
> potentially "produces" a store to memory and in_insn represents an rtl
> expression that potentially "consumes" a value recently stored to memory.
> 
You have this reversed, the code is trying to find situations where
out_insn is producing a value that in_insn will be storing to memory.

> 2. If the memory store produced matches the memory fetch consumed, this
> function returns true to indicate that this sequence of two instructions
> qualifies for a special "bypass" latency that represents the fact that
> the fetch will obtain the value out of the write buffer.  So, whereas
> the instruction scheduler might normally expect that this sequence of
> two instructions would experience Load-Hit-Store penalties associated
> with cache coherency hardware costs, since these two instruction qualify
> for the store_data_bypass optimization, the instruction scheduler counts
> the latency as only 1 or 2 cycles (potentially).  [This is what I
> understand, but I may be wrong, so please correct me if so.]
> 
In general, yes, if the function returns true then the sequence has been
identified and the target will take appropriate action (adjusting
latency or whatever).


As for the remainder below dealing with PARALLELs, I don't have any
history on that so hopefully others can chime in. For the rs6000 port, I
don't see the handling of multiple SET operations making much sense, but
then again I don't know if it will actually occur either based on the
places where store_data_bypass_p is used.

-Pat

> 3. Actually, what I described above is only the "simple" case.  It may
> be that the rtl for either out_insn or in_insn is really a parallel
> clause with multiple rtl trees beneath it.  In this case, we compare the
> subtrees in a "similar" way to see if the compound expressions qualify
> for the store_data_bypass_p "optimization".  (I've got some questions
> about how this is done below)  As currently implemented, special
> handling is given to a CLOBBER subtree as part of either PARALLEL
> expression: we ignore it.  This is because CLOBBER does not represent
> any real machine instructions.  It just represents semantic information
> that might be used by the compiler.
> 
> In addition to seeking confirmation of my existing understanding of the
> code as outlined above, the specific questions that I am seeking help
> with are:
> 
> 1. In the current implementation (as I understand it), near the top of
> the function body, we handle the case that the consumer (in_insn) rtl is
> a single SET expression and the producer (out_insn) rtl is a PARALLEL
> expression containing multiple sets.  The way I read this code, we are
> requiring that every one of the producer's parallel SET instructions
> produce the same value that is to be consumed in order to qualify this
> sequence as a "store data bypass".  That seems wrong to me.  I would
> expect that we only need "one" of the produced values to match the
> consumed value in order to qualify for the "store data bypass"
> optimization.  Please explain.  (The same confusing behavior happens
> below in the same function, in the case that the consumer rtl is a
> PARALLEL expression of multiple SETs: we require that every producer's
> stored value match every consumer's fetched value.)
> 
> 2. A "bigger" concern is that any time any SETs are buried within a
> PARALLEL tree, I'm not sure the answer produced by this function, as
> currently implemented, is at all reliable:
> 
>  a) PARALLEL does not necessarily mean all of its subtrees happen in
> parallel on hardware.  It just means that there is no sequencing imposed
> by the source code, so the final order in which the multiple subtrees
> beneath the PARALLEL node is not known at this stage of compilation.
> 
>  b) It seems to me that it doesn't really make sense to speak of whether
> a whole bunch of producers combined with a whole bunch of consumers
> qualify for an optimized store data bypass latency.  If we say that they
> do qualify (as a group), which pair(s) of producer and consumer machine
> instructions qualify?  It seems we need to know which producer matches
> with which consumer in order to know where the bypass latencies "fit"
> into the schedule.
> 
>  c) Furthermore, if it turns out that the "arbitrary" order in which the
> producer instructions and consumer instructions are emitted places too
> much "distance" between a producer and the matching consumer, then it is
> possible that by the time the hardware executes the consumer, the stored
> value is no longer in the write buffer, so even though we might have
> "thought" two PARALLEL rtl expressions qualified for the store bypass
> optimization, we really should have returned false.
> 
> Can someone help me understand this better?
> 
> Thanks much.
> 
> 



RFC: seeking insight on store_data_bypass_p (recog.c)

2017-04-12 Thread Kelvin Nilsen


My work on PR80101 is "motivating" me to modify the implementation of
store_data_bypass_p (in gcc/recog.c).

I have a patch that bootstraps with no regressions.  However, I think
"regression" testing may not be enough to prove I got this right.  If my
new patch returns the wrong value, the outcome will be poor instruction
scheduling decisions, which will impact performance, but probably not
"correctness".

So I'd like some help understanding the existing implementation of
store_data_bypass_p.  To establish some context, here is what I think I
understand about this function:

1. As input arguments, out_insn represents an rtl expression that
potentially "produces" a store to memory and in_insn represents an rtl
expression that potentially "consumes" a value recently stored to memory.

2. If the memory store produced matches the memory fetch consumed, this
function returns true to indicate that this sequence of two instructions
qualifies for a special "bypass" latency that represents the fact that
the fetch will obtain the value out of the write buffer.  So, whereas
the instruction scheduler might normally expect that this sequence of
two instructions would experience Load-Hit-Store penalties associated
with cache coherency hardware costs, since these two instruction qualify
for the store_data_bypass optimization, the instruction scheduler counts
the latency as only 1 or 2 cycles (potentially).  [This is what I
understand, but I may be wrong, so please correct me if so.]

3. Actually, what I described above is only the "simple" case.  It may
be that the rtl for either out_insn or in_insn is really a parallel
clause with multiple rtl trees beneath it.  In this case, we compare the
subtrees in a "similar" way to see if the compound expressions qualify
for the store_data_bypass_p "optimization".  (I've got some questions
about how this is done below)  As currently implemented, special
handling is given to a CLOBBER subtree as part of either PARALLEL
expression: we ignore it.  This is because CLOBBER does not represent
any real machine instructions.  It just represents semantic information
that might be used by the compiler.

In addition to seeking confirmation of my existing understanding of the
code as outlined above, the specific questions that I am seeking help
with are:

1. In the current implementation (as I understand it), near the top of
the function body, we handle the case that the consumer (in_insn) rtl is
a single SET expression and the producer (out_insn) rtl is a PARALLEL
expression containing multiple sets.  The way I read this code, we are
requiring that every one of the producer's parallel SET instructions
produce the same value that is to be consumed in order to qualify this
sequence as a "store data bypass".  That seems wrong to me.  I would
expect that we only need "one" of the produced values to match the
consumed value in order to qualify for the "store data bypass"
optimization.  Please explain.  (The same confusing behavior happens
below in the same function, in the case that the consumer rtl is a
PARALLEL expression of multiple SETs: we require that every producer's
stored value match every consumer's fetched value.)

2. A "bigger" concern is that any time any SETs are buried within a
PARALLEL tree, I'm not sure the answer produced by this function, as
currently implemented, is at all reliable:

 a) PARALLEL does not necessarily mean all of its subtrees happen in
parallel on hardware.  It just means that there is no sequencing imposed
by the source code, so the final order in which the multiple subtrees
beneath the PARALLEL node is not known at this stage of compilation.

 b) It seems to me that it doesn't really make sense to speak of whether
a whole bunch of producers combined with a whole bunch of consumers
qualify for an optimized store data bypass latency.  If we say that they
do qualify (as a group), which pair(s) of producer and consumer machine
instructions qualify?  It seems we need to know which producer matches
with which consumer in order to know where the bypass latencies "fit"
into the schedule.

 c) Furthermore, if it turns out that the "arbitrary" order in which the
producer instructions and consumer instructions are emitted places too
much "distance" between a producer and the matching consumer, then it is
possible that by the time the hardware executes the consumer, the stored
value is no longer in the write buffer, so even though we might have
"thought" two PARALLEL rtl expressions qualified for the store bypass
optimization, we really should have returned false.

Can someone help me understand this better?

Thanks much.


-- 
Kelvin Nilsen, Ph.D.  kdnil...@linux.vnet.ibm.com
home office: 801-756-4821, cell: 520-991-6727
IBM Linux Technology Center - PPC Toolchain