Re: [PATCH 1/3]middle-end: support vec_cbranch_any and vec_cbranch_all [PR118974]

Richard Biener Tue, 10 Jun 2025 08:16:52 -0700

> Am 10.06.2025 um 15:51 schrieb Tamar Christina <tamar.christ...@arm.com>:
> 
> 
>> 
>> -----Original Message-----
>> From: Richard Biener <rguent...@suse.de>
>> Sent: Tuesday, June 10, 2025 2:12 PM
>> To: Tamar Christina <tamar.christ...@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; Richard Sandiford <richard.sandif...@arm.com>;
>> nd <n...@arm.com>
>> Subject: RE: [PATCH 1/3]middle-end: support vec_cbranch_any and
>> vec_cbranch_all [PR118974]
>> 
>> On Mon, 9 Jun 2025, Tamar Christina wrote:
>> 
>>>> -----Original Message-----
>>>> From: Richard Biener <rguent...@suse.de>
>>>> Sent: Monday, June 9, 2025 10:30 AM
>>>> To: Tamar Christina <tamar.christ...@arm.com>
>>>> Cc: gcc-patches@gcc.gnu.org; Richard Sandiford
>> <richard.sandif...@arm.com>;
>>>> nd <n...@arm.com>
>>>> Subject: Re: [PATCH 1/3]middle-end: support vec_cbranch_any and
>>>> vec_cbranch_all [PR118974]
>>>> 
>>>>> On Mon, 9 Jun 2025, Tamar Christina wrote:
>>>>> 
>>>>>> This patch introduces two new vector cbranch optabs vec_cbranch_any and
>>>>>> vec_cbranch_all.
>>>>>> 
>>>>>> To explain why we need two new optabs let me explain the current cbranch
>>> and
>>>>> its
>>>>>> limitations and what I'm trying to optimize. So sorry for the long 
>>>>>> email, but I
>>>>>> hope it explains why I think we want new optabs.
>>>>>> 
>>>>>> Today cbranch can be used for both vector and scalar modes.  In both 
>>>>>> these
>>>>>> cases it's intended to compare boolean values, either scalar or vector.
>>>>>> 
>>>>>> The optab documentation does not however state that it can only handle
>>>>>> comparisons against 0.  So many targets have added code for the vector
>>> variant
>>>>>> that tries to deal with the case where we branch based on two non-zero
>>>>>> registers.
>>>>>> 
>>>>>> However this code can't ever be reached because the cbranch expansion 
>>>>>> only
>>>>> deals
>>>>>> with comparisons against 0 for vectors.  This is because for vectors the 
>>>>>> rest of
>>>>>> the compiler has no way to generate a non-zero comparison. e.g. the
>>> vectorizer
>>>>>> will always generate a zero comparison, and the C/C++ front-ends won't
>>> allow
>>>>>> vectors to be used in a cbranch as it expects a boolean value.  ISAs 
>>>>>> like SVE
>>>>>> work around this by requiring you to use an SVE PTEST intrinsics which 
>>>>>> results
>>>>>> in a single scalar boolean value that represents the flag values.
>>>>>> 
>>>>>> e.g. if (svptest_any (..))
>>>>>> 
>>>>>> The natural question is why do we not at expand time then rewrite the
>>>>> comparison
>>>>>> to a non-zero comparison if the target supports it.
>>>>>> 
>>>>>> The reason is we can't safely do so.  For an ANY comparison (e.g. != b) 
>>>>>> this is
>>>>>> trivial, but for an ALL comparison (e.g. == b) we would have to flip both
>>> branch
>>>>>> and invert the value being compared.  i.e. we have to make it a != b
>>> comparison.
>>>>>> 
>>>>>> But in emit_cmp_and_jump_insns we can't flip the branches anymore
>>> because
>>>>> they
>>>>>> have already been lowered into a fall through branch (PC) and a label, 
>>>>>> ready
>>> for
>>>>>> use in an if_then_else RTL expression.
>>>>>> 
>>>>>> Additionally as mentioned before, cbranches expect the values to be 
>>>>>> masks,
>>> not
>>>>>> values.  This kinda works out if you XOR the values, but for FP vectors 
>>>>>> you'd
>>>>>> need to know what equality means for the FP format.  i.e. it's possible 
>>>>>> for
>>>>>> IEEE 754 values but It's not immediately obvious if this is true for all
>>>>>> formats.
>>>>>> 
>>>>>> Now why does any of this matter?  Well there are two optimizations we 
>>>>>> want
>>> to
>>>>> be
>>>>>> able to do.
>>>>>> 
>>>>>> 1. Adv. SIMD does not support a vector !=, as in there's no instruction 
>>>>>> for it.
>>>>>>   For both Integer and FP vectors we perform the comparisons as EQ and
>>> then
>>>>>>   invert the resulting mask.  Ideally we'd like to replace this with 
>>>>>> just a XOR
>>>>>>   and the appropriate branch.
>>>>>> 
>>>>>> 2. When on an SVE enabled system we would like to use an SVE compare +
>>>>> branch
>>>>>>   for the Adv. SIMD sequence which could happen due to cost modelling.
>>>>> However
>>>>>>   we can only do so based on if we know that the values being compared
>>> against
>>>>>>   are the boolean masks.  This means we can't really use combine to do 
>>>>>> this
>>>>>>   because combine would have to match the entire sequence including the
>>>>>>   vector comparisons because at RTL we've lost the information that
>>>>>>   VECTOR_BOOLEAN_P would have given us.  This sequence would be too
>>> long
>>>>> for
>>>>>>   combine to match due to it having to match the compare + branch 
>>>>>> sequence
>>>>>>   being generated as well.  It also becomes a bit messy to match ANY and 
>>>>>> ALL
>>>>>>   sequences.
>>>>> 
>>>>> I guess I didn't really understand the above, esp. why all of this is
>>>>> not recoverable on RTL, so I'll state what I think
>>>>> cbranch is supposed to do and then ask some question on the proposal.
>>>>> 
>>> 
>>> Quite simply the log chain is too long. SVE has flag setting comparisons, 
>>> so in a
>>> single instruction we can perform the final comparison. (which doesn't need 
>>> to
>>> be restricted to EQ or NE) and then using the appropriate condition code 
>>> decide
>>> if we have an ANY or ALL comparison.  So the sequence to match is very long.
>>> 
>>> That is to say, in gimple we have
>>> 
>>> a = b > c
>>> if (a != 0)
>>>  ...
>>> and SVE doesn't need the indirection.  Conceptually this makes it
>>> 
>>> if (any (b > c))
>>> ...
>>> 
>>> For SVE the cbranch expands to a predicate test, which also sets the flag 
>>> again.
>>> this duplication in flag setting is later then removed by various patterns 
>>> in
>>> the backend.  This is all fine, but where it gets difficult is trying to 
>>> replace the
>>> Adv. SIMD Compare + branch with an SVE one.  To do this it would have to 
>>> match
>>> the entirety of
>>> 
>>> a = b > c
>>> if (a != 0)
>>> 
>>> which in RTL is
>>> 
>>> (insn 66 65 67 4 (set (reg:V16QI 185 [ mask_patt_9.20_138 ])
>>>        (neg:V16QI (gtu:V16QI (reg:V16QI 183)
>>>                (reg:V16QI 184)))) "/app/example.cpp":29:7 -1
>>>     (nil))
>>> (insn 67 66 68 4 (set (reg:V4SI 186)
>>>        (unspec:V4SI [
>>>                (subreg:V4SI (reg:V16QI 185 [ mask_patt_9.20_138 ]) 0)
>>>                (subreg:V4SI (reg:V16QI 185 [ mask_patt_9.20_138 ]) 0)
>>>            ] UNSPEC_UMAXV)) "/app/example.cpp":29:7 -1
>>>     (nil))
>>> (insn 68 67 69 4 (set (reg:V16QI 185 [ mask_patt_9.20_138 ])
>>>        (subreg:V16QI (reg:V4SI 186) 0)) "/app/example.cpp":29:7 -1
>>>     (nil))
>>> (insn 69 68 70 4 (set (reg:DI 187)
>>>        (subreg:DI (reg:V16QI 185 [ mask_patt_9.20_138 ]) 0))
>> "/app/example.cpp":29:7 -1
>>>     (nil))
>>> (insn 70 69 71 4 (set (reg:CC 66 cc)
>>>        (compare:CC (reg:DI 187)
>>>            (const_int 0 [0]))) "/app/example.cpp":29:7 -1
>>>     (nil))
>>> (jump_insn 71 70 72 4 (set (pc)
>>>        (if_then_else (ne (reg:CC 66 cc)
>>>                (const_int 0 [0]))
>>>            (label_ref:DI 397)
>>>            (pc))) "/app/example.cpp":29:7 19 {condjump}
>>> 
>>> And any intermediate "submatch" isn't really valid. The sequence would need 
>>> to
>> be matched
>>> as a whole or not at all.
>>> 
>>> The second problem is the lack of Adv. SIMD is having a vector != 
>>> comparison.
>>> 
>>> This means there's an additional "NOT" in inverting the mask.   This 
>>> operation can
>> be optimized
>>> in three ways.
>>> 
>>> 1.  it can be converted from a MAX reduction into a MIN reduction and then 
>>> the
>> NOT can be removed.
>>>     The branch would have to be inverted.
>>> 2.  The data vectors can be XOR'ed together, and inequality means a 
>>> non-zero bit
>> anywhere in the result.
>>> 3.  Use an SVE comparison, which does have != on vectors.
>>> 
>>> This chain is again quite long to match, and we can't incrementally match 
>>> it since
>> it's only value if we know
>>> we're using the results in a zero or non-zero comparison.
>>> 
>>>> cbranch is supposed to branch on the result of a comparison,
>>>> being it on vector or scalar operands, the result of the comparison is
>>>> a scalar, or rather in full generality, a (scalar) condition-code.
>>>> Given it uses a COMPARE RTX as comparison operator we can currently
>>>> only handle whole vector equality/inequality which map to all/any.
>>> 
>>> Correct, and normally this is fine, but for SVE we have support for what's
>>> essentially "vector" condition code.  This is why I also named the new 
>>> optabs
>>> vec_cbranch.
>>> 
>>> The biggest difference is that instead of requiring the reduction to a 
>>> scalar
>>> It allows the branching on vector elements comparisons *directly*.  Which 
>>> makes
>>> it much easier to do the two fixes above.
>>> 
>>>> 
>>>>> To handle these two cases I propose the new optabs vec_cbranch_any and
>>>>> vec_cbranch_all that expect the operands to be values, not masks, and the
>>>>> comparison operation is the comparison being performed.  The current
>> cbranch
>>>>> optab can't be used here because we need to be able to see both comparison
>>>>> operators (for the boolean branch and the data compare branch).
>>>>> 
>>>>> The initial != 0 or == 0 is folded away into the name as _any and _all 
>>>>> allowing
>>>>> the target to easily do as it wishes.
>>>>> 
>>>>> I have intentionally chosen to require cbranch_optab to still be the main 
>>>>> one.
>>>>> i.e. you can't have vec_cbranch_any/vec_cbranch_all without cbranch
>> because
>>>>> these two are an expand time only construct.  I've also chosen them to be 
>>>>> this
>>>>> such that there's no change needed to any other passes in the middle-end.
>>>>> 
>>>>> With these two optabs it's trivial to implement the two optimization I
>> described
>>>>> above.  A target expansion hook is also possible but optabs felt more 
>>>>> natural
>>>>> for the situation.
>>>> 
>>>> What's the RTL the optabs expand to for SVE/AdvSIMD?  How do you
>>>> represent the any/all vector compare difference?  Looking at 3/3 I see
>>>> you use UNSPECs.
>>>> 
>>> 
>>> You mean in the expand RTL? That's just so I can make the iterators.  The
>> expansion is:
>>> 
>>> (insn 31 30 32 5 (parallel [
>>>            (set (reg:VNx4BI 133 [ mask_patt_13.20_41 ])
>>>                (unspec:VNx4BI [
>>>                        (reg:VNx4BI 134)
>>>                        (const_int 1 [0x1])
>>>                        (gtu:VNx4BI (reg:VNx4QI 112 [ vect__1.16 ])
>>>                            (reg:VNx4QI 128))
>>>                    ] UNSPEC_PRED_Z))
>>>            (clobber (reg:CC_NZC 66 cc))
>>>        ]) "/app/example.cpp":29:7 -1
>>>     (nil))
>>> (insn 32 31 33 5 (set (reg:VNx4BI 136 [ vec_mask_and_42 ])
>>>        (and:VNx4BI (reg:VNx4BI 133 [ mask_patt_13.20_41 ])
>>>            (reg:VNx4BI 111 [ loop_mask_36 ]))) "/app/example.cpp":29:7 -1
>>>     (nil))
>>> (insn 33 32 34 5 (set (reg:VNx16BI 137)
>>>        (const_vector:VNx16BI repeat [
>>>                (const_int 1 [0x1])
>>>                (const_int 0 [0]) repeated x3
>>>            ])) "/app/example.cpp":29:7 -1
>>>     (nil))
>>> (insn 34 33 35 5 (set (reg:CC_NZC 66 cc)
>>>        (unspec:CC_NZC [
>>>                (reg:VNx16BI 137)
>>>                (subreg:VNx4BI (reg:VNx16BI 137) 0)
>>>                (const_int 1 [0x1])
>>>                (reg:VNx4BI 136 [ vec_mask_and_42 ])
>>>            ] UNSPEC_PTEST)) "/app/example.cpp":29:7 -1
>>>     (nil))
>>> (jump_insn 35 34 36 5 (set (pc)
>>>        (if_then_else (eq (reg:CC_NZC 66 cc)
>>>                (const_int 0 [0]))
>>>            (label_ref 41)
>>>            (pc))) "/app/example.cpp":29:7 -1
>>>     (int_list:REG_BR_PROB 1014686028 (nil))
>>> -> 41)
>>> 
>>> Notice how insn 31 which is the vector compare sets CC_NZC.
>> 
>> That's a "scalar" condition code?
>> 
> 
> Sure, depends on how you view it. It's set by a vector instruction.
> SVE doesn't need the reduction to scalar. As In the predicate
> result is unused and marked as dead.  I think that's the core distinction
> here and one you pointed out below.  This is about not needing the
> reduction for branching.  Or rather the result of the compare isn't
> needed for the branch itself.
> 
>>> Insn 32 is the masking of the compare, it'll be moved into insn 31 by
>>> combine replacing the all true predicate.
>> 
>> So the (reg:VNx4BI 133) is essentially unused, but it's required to ...
>> 
>>> Insn 34 is a flag reshaping operation.  In case the predicates don't
>>> have the same element width the CC reg values needs to be
>>> reshaped (in case you're doing an eg. b.first (branch if first element 
>>> matches)
>> etc).
>> 
>> ... do this, which produces the actual CC_NZC we jump on.
> 
> No, it's only needed because the codegen above is generated from the current
> trunk which has a different instruction doing the data comparison.
> 
> 133 is not needed at all. And is only there because insn 34 is what is mapped 
> to
> a cbranch.  So this is the code we have today because there's no way for us to
> not emit the ptest initially ☹
> 
> And our current ptest also emits CC_NZC instead of CC_Z, while ptest does set 
> these
> for cbranch we don't need them. So I'm changing this to CC_Z already.
> 
>> 
>>> However because insn 35 only cares about 0 or !0 the shape doesn't matter.
>>> So insn 34 will be eliminated by combine.
>>> 
>>> This leaves insn and 31 and 35.  Essentially only the compare + branch,
>>> directly on the vector elements, which is what vec_cbranch is now
>>> representing.
>>> 
>>> The optab allows us to generate the paradoxical subregs to convert
>>> Adv. SIMD registers into SVE ones, and the predicate mask to make
>>> the operation safe.
>>> 
>>> Without this, we have to rely on combine which (even if it could match
>>> such long sequences) would have to not just replace the adv. simd + b
>>> with an SVE + b, but also do all the predicate optimization in the one
>>> single pattern.
>>> 
>>>> What kind of compares do you expect us to generate?  Any besides
>>>> equality compares?
>>> 
>>> Any compare is possible. Since SVE vector comparisons take all 6
>>> comparison operators, there's also floating point versions and
>>> unordered variants.  Basically in the ISA vectors are first class
>>> control flow elements.
>>> 
>>>> Why would we not allow cbranch to take an
>>>> operand 0 that specifies the desired reduction?
>>> 
>>> That's certainly possible, but that's just going to make cbranch itself
>>> more complicated to understand.  I chose to go with _any and _all
>>> since as you say, for the branch comparison itself only != 0 and == 0
>>> makes sense.  So cbranch would need to get two operators in this case
>>> one for the branch comparison and one for the data comparison.
>>> 
>>> i.e. if expanding
>>> 
>>> a = b > c
>>> if (a != 0)
>>> 
>>> branch would get
>>>  (set (reg a)
>>>    (gt (reg b) (reg c))
>>>  (clobber cc_nz)
>>>  (ne (reg a) 0)
>>>  (if_then_else ...)
>>> 
>>> I basically opted for a simpler change, since you can only have != 0 and == >>> 0
>>> this is just encoded in the name "any" and "all" so we don't have to send 
>>> the
>>> dummy operators along.
>> 
>> Well, if we'd introduce ANY_LT_EXPR, ANY_NE_EXPR, ... ALL_LT_EXPR,
>> ALL_NE_EXPR ... then we can have on GIMPLE
>> 
>> if (a ANY_GT c)
>> 
>> and no need for a separate compare and compare-and-branch.
> 
> Yes, new comparison codes would work here too.
> But then I'll need 12 of them.. But could do so if you prefer 😊
> 
> But you're right, conceptually my new optabs are introducing an
> expand time only representation of these 12 operators.
> 
>> 
>>>> We could
>>>> have (any:CC (eq:V4BI (reg:V4SI) (reg:V4SI))) or so, or alternatively
>>>> (more intrusive) have (any_eq:CC (reg:V4SI) (reg:V4SI))?  That
>>>> would also allow you to have a proper RTL representation rather
>>>> than more and more UNSEPCs.
>>> 
>>> We do have proper RTL representation though..
>> 
>> You have
>> 
>>> (insn 31 30 32 5 (parallel [
>>>            (set (reg:VNx4BI 133 [ mask_patt_13.20_41 ])
>>>                (unspec:VNx4BI [
>>>                        (reg:VNx4BI 134)
>>>                        (const_int 1 [0x1])
>>>                        (gtu:VNx4BI (reg:VNx4QI 112 [ vect__1.16 ])
>>>                            (reg:VNx4QI 128))
>>>                    ] UNSPEC_PRED_Z))
>>>            (clobber (reg:CC_NZC 66 cc))
>> 
>> that's not
>> 
>>  (set (reg:CC_NZC 66 cc) (any_gtu:CC_NZC (reg:VNx4QI 112 [ vect__1.16 ])
>>                            ((reg:VNx4QI 128)))
>> 
>> so nothing combine or simplify-rtx can work with (because of the UNSPEC).
>> 
> 
> Sure, I don’t think any_gtu provides any value here though. this information
> "any" or "all" is a property of the use of the result not the comparison in 
> RTL.

But that was the gist of the question- so the compare does _not_ set a CC that 
you can directly branch on but there’s a use insn of <whatever> that either 
all- or any-reduces <whatever> to a CC?  I’m now confused…

> So the "none" is here
> 
> (jump_insn 35 34 36 5 (set (pc)
>         (if_then_else (eq (reg:CC_NZC 66 cc)
>                 (const_int 0 [0]))
>             (label_ref 41)
>             (pc))) "/app/example.cpp":29:7 -1
>      (int_list:REG_BR_PROB 1014686028 (nil))
> 
> So I don’t think that the ANY or ALL was ever a property of the compare,
> But rather what you do with it.
> Which is why today in gimple we have the != 0 or == 0 in the if no?
> 
>> That said, I wanted to know whether SVE/NEON can do {any,all}_<compare>
>> ops on vectors, setting the CC flag register you can then do a
>> conditional jump on.  And that seems to be the case and what you
>> are after.
>> 
> 
> Yes, SVE can indeed for integer comparisons. Floating point needs the
> ptest.
> 
> Also other operations can do this too, for instance the MATCH instruction
> being worked on now etc
> 
>> But that's exactly what the cbranch optab is supposed to facilitate.
>> What we're lacking is communication of the all vs. any.  Your choice
>> is to do that by using different optabs, 'vec_cbranch_any' vs.
>> 'vec_cbranch_all'.  I see how thats "easy", you just have to adjust
>> RTL expansion.
>> 
> 
> Indeed, and the benefit of this is that we can expand directly to the
> optimal sequence, rather than relying on later passes to eliminate
> the unneeded predicate operations.
> 
>> On GIMPLE you could argue we have everything, we can support
>> 
>> if (_1 ==/!= {-1, -1, ... })
>> 
>> to match vec_cbranch_all/any for an inverted compare?
>> 
> 
> Yes, so my patch at expand time just checks if _1 is a compare
> And if so, reads the arguments of _1 and based on ==/!= it
> chooses vec_cbranch_all/any and passes the argument of the
> compare to the optab.   It can't handle {-1,-1,-1,...} today because
> the expansion happens quite low during expand, so we can no
> longer flip the branches to make them a normal != 0 compare.
> 
> I hadn't worried about this since it would just fall back to cbranch
> and we no longer generate this representation.
> 
>> I see one of your issues is that the AND does not set flags, right?
>> You do need the ptest once you have that?
>> 
> 
> The AND is just because the middle-end can only generate an unpredicated
> compare. And so we rely on combine to push the predicate inside eventually.
> 
> So we don't need the AND itself to set flags, because for early exits all we 
> care
> About is CC_Z, and the AND can have an effect on N and C but not Z.
> 
> So for pure SVE I can eliminate it (see patch 2/3) but for the case where we 
> have
> to generate an SVE compare for Adv. SIMD (as it does the flag setting) means 
> that
> we have to match everything from the compare to the if_then_else.  You need
> the compare since you have to replace it with the SVE variant, and you need 
> the
> if_then_else to realize if you have any or all.
> 
> The optab allows us to avoid this by just emitting the SVE commit directly, 
> and
> completely avoid the AND by generating a predicated compare.
> 
> So I am trying to fix some of the optimization in RTL, but for some cases 
> expanding
> correctly is the most reliable way.
> 
> I would have though that the two optabs would be preferrable to 12 new 
> comparison
> operators, but could make those work too 😊
> 
> Thanks,
> Tamar
> 
>> 
>> +@cindex @code{cbranch_any@var{mode}4} instruction pattern
>> +@item @samp{cbranch_any@var{mode}4}
>> +Conditional branch instruction combined with a compare instruction on
>> vectors
>> +where it is required that at least one of the elementwise comparisons of
>> the
>> +two input vectors is true.
>> 
>> That sentence is a bit odd.  Operand 0 is a comparison operator that
>> is applied to all vector elements.  When at least one comparison evaluates
>> to true the branch is taken.
>> 
>> +Operand 0 is a comparison operator.  Operand 1 and operand 2 are the
>> +first and second operands of the comparison, respectively.  Operand 3
>> +is the @code{code_label} to jump to.
>> 
>> Tamar Christina
>> 
>> AttachmentsJun 9, 2025, 8:03 AM (1 day ago)
>> 
>> to gcc-patches, nd, rguenther
>> This patch introduces two new vector cbranch optabs vec_cbranch_any and
>> vec_cbranch_all.
>> 
>> To explain why we need two new optabs let me explain the current cbranch
>> and its
>> limitations and what I'm trying to optimize. So sorry for the long email,
>> but I
>> hope it explains why I think we want new optabs.
>> 
>> Today cbranch can be used for both vector and scalar modes.  In both these
>> cases it's intended to compare boolean values, either scalar or vector.
>> 
>> The optab documentation does not however state that it can only handle
>> comparisons against 0.  So many targets have added code for the vector
>> variant
>> that tries to deal with the case where we branch based on two non-zero
>> registers.
>> 
>> However this code can't ever be reached because the cbranch expansion only
>> deals
>> with comparisons against 0 for vectors.  This is because for vectors the
>> rest of
>> the compiler has no way to generate a non-zero comparison. e.g. the
>> vectorizer
>> will always generate a zero comparison, and the C/C++ front-ends won't
>> allow
>> vectors to be used in a cbranch as it expects a boolean value.  ISAs like
>> SVE
>> work around this by requiring you to use an SVE PTEST intrinsics which
>> results
>> in a single scalar boolean value that represents the flag values.
>> 
>> e.g. if (svptest_any (..))
>> 
>> The natural question is why do we not at expand time then rewrite the
>> comparison
>> to a non-zero comparison if the target supports it.
>> 
>> The reason is we can't safely do so.  For an ANY comparison (e.g. != b)
>> this is
>> trivial, but for an ALL comparison (e.g. == b) we would have to flip both
>> branch
>> and invert the value being compared.  i.e. we have to make it a != b
>> comparison.
>> 
>> But in emit_cmp_and_jump_insns we can't flip the branches anymore because
>> they
>> have already been lowered into a fall through branch (PC) and a label,
>> ready for
>> use in an if_then_else RTL expression.
>> 
>> Additionally as mentioned before, cbranches expect the values to be masks,
>> not
>> values.  This kinda works out if you XOR the values, but for FP vectors
>> you'd
>> need to know what equality means for the FP format.  i.e. it's possible
>> for
>> IEEE 754 values but It's not immediately obvious if this is true for all
>> formats.
>> 
>> Now why does any of this matter?  Well there are two optimizations we want
>> to be
>> able to do.
>> 
>> 1. Adv. SIMD does not support a vector !=, as in there's no instruction
>> for it.
>>   For both Integer and FP vectors we perform the comparisons as EQ and
>> then
>>   invert the resulting mask.  Ideally we'd like to replace this with just
>> a XOR
>>   and the appropriate branch.
>> 
>> 2. When on an SVE enabled system we would like to use an SVE compare +
>> branch
>>   for the Adv. SIMD sequence which could happen due to cost modelling.
>> However
>>   we can only do so based on if we know that the values being compared
>> against
>>   are the boolean masks.  This means we can't really use combine to do
>> this
>>   because combine would have to match the entire sequence including the
>>   vector comparisons because at RTL we've lost the information that
>>   VECTOR_BOOLEAN_P would have given us.  This sequence would be too long
>> for
>>   combine to match due to it having to match the compare + branch
>> sequence
>>   being generated as well.  It also becomes a bit messy to match ANY and
>> ALL
>>   sequences.
>> 
>> To handle these two cases I propose the new optabs vec_cbranch_any and
>> vec_cbranch_all that expect the operands to be values, not masks, and the
>> comparison operation is the comparison being performed.  The current
>> cbranch
>> optab can't be used here because we need to be able to see both comparison
>> operators (for the boolean branch and the data compare branch).
>> 
>> The initial != 0 or == 0 is folded away into the name as _any and _all
>> allowing
>> the target to easily do as it wishes.
>> 
>> I have intentionally chosen to require cbranch_optab to still be the main
>> one.
>> i.e. you can't have vec_cbranch_any/vec_cbranch_all without cbranch
>> because
>> these two are an expand time only construct.  I've also chosen them to be
>> this
>> such that there's no change needed to any other passes in the middle-end.
>> 
>> With these two optabs it's trivial to implement the two optimization I
>> described
>> above.  A target expansion hook is also possible but optabs felt more
>> natural
>> for the situation.
>> 
>> I.e. with them we can now generate
>> 
>> .L2:
>>        ldr     q31, [x1, x2]
>>        add     v29.4s, v29.4s, v25.4s
>>        add     v28.4s, v28.4s, v26.4s
>>        add     v31.4s, v31.4s, v30.4s
>>        str     q31, [x1, x2]
>>        add     x1, x1, 16
>>        cmp     x1, 2560
>>        beq     .L1
>> .L6:
>>        ldr     q30, [x3, x1]
>>        cmpeq   p15.s, p7/z, z30.s, z27.s
>>        b.none  .L2
>> 
>> and easily prove it correct.
>> 
>> Bootstrapped Regtested on aarch64-none-linux-gnu,
>> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
>> -m32, -m64 and no issues.
>> 
>> Ok for master?
>> 
>> Thanks,
>> Tamar
>> 
>> gcc/ChangeLog:
>> 
>>        PR target/118974
>>        * optabs.cc (prepare_cmp_insn): Refactor to take optab to check
>> for
>>        instead of hardcoded cbranch.
>>        (emit_cmp_and_jump_insns): Try to emit a vec_cbranch if supported.
>>        * optabs.def (vec_cbranch_any_optab, vec_cbranch_all_optab): New.
>>        * doc/md.texi (cbranch_any@var{mode}4, cbranch_any@var{mode}4):
>> Document
>>        them.
>> 
>> ---
>> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
>> index
>> f6314af46923beee0100a1410f089efd34d7566d..7dbfbbe1609a196b0834d458
>> b26f61904eaf5b24
>> 100644
>> --- a/gcc/doc/md.texi
>> +++ b/gcc/doc/md.texi
>> @@ -7622,6 +7622,24 @@ Operand 0 is a comparison operator.  Operand 1 and
>> operand 2 are the
>> first and second operands of the comparison, respectively.  Operand 3
>> is the @code{code_label} to jump to.
>> 
>> +@cindex @code{cbranch_any@var{mode}4} instruction pattern
>> +@item @samp{cbranch_any@var{mode}4}
>> +Conditional branch instruction combined with a compare instruction on
>> vectors
>> +where it is required that at least one of the elementwise comparisons of
>> the
>> +two input vectors is true.
>> +Operand 0 is a comparison operator.  Operand 1 and operand 2 are the
>> +first and second operands of the comparison, respectively.  Operand 3
>> +is the @code{code_label} to jump to.
>> +
>> +@cindex @code{cbranch_all@var{mode}4} instruction pattern
>> +@item @samp{cbranch_all@var{mode}4}
>> +Conditional branch instruction combined with a compare instruction on
>> vectors
>> +where it is required that at all of the elementwise comparisons of the
>> +two input vectors are true.
>> +Operand 0 is a comparison operator.  Operand 1 and operand 2 are the
>> +first and second operands of the comparison, respectively.  Operand 3
>> +is the @code{code_label} to jump to.
>> +
>> @cindex @code{jump} instruction pattern
>> @item @samp{jump}
>> A jump inside a function; an unconditional branch.  Operand 0 is the
>> diff --git a/gcc/optabs.cc b/gcc/optabs.cc
>> index
>> 0a14b1eef8a5795e6fd24ade6da55841696315b8..77d5e6ee5d26ccda3d39126
>> 5bc45fd454530cc67
>> 100644
>> --- a/gcc/optabs.cc
>> +++ b/gcc/optabs.cc
>> @@ -4418,6 +4418,9 @@ can_vec_extract_var_idx_p (machine_mode vec_mode,
>> machine_mode extr_mode)
>> 
>>    *PMODE is the mode of the inputs (in case they are const_int).
>> 
>> +   *OPTAB is the optab to check for OPTAB_DIRECT support.  Defaults to
>> +   cbranch_optab.
>> +
>>    This function performs all the setup necessary so that the caller only
>> has
>>    to emit a single comparison insn.  This setup can involve doing a
>> BLKmode
>>    comparison or emitting a library call to perform the comparison if no
>> insn
>> @@ -4429,7 +4432,7 @@ can_vec_extract_var_idx_p (machine_mode vec_mode,
>> machine_mode extr_mode)
>> static void
>> prepare_cmp_insn (rtx x, rtx y, enum rtx_code comparison, rtx size,
>>                  int unsignedp, enum optab_methods methods,
>> -                 rtx *ptest, machine_mode *pmode)
>> +                 rtx *ptest, machine_mode *pmode, optab
>> optab=cbranch_optab)
>> {
>>   machine_mode mode = *pmode;
>>   rtx libfunc, test;
>> @@ -4547,7 +4550,7 @@ prepare_cmp_insn (rtx x, rtx y, enum rtx_code
>> comparison, rtx size,
>>   FOR_EACH_WIDER_MODE_FROM (cmp_mode, mode)
>>     {
>>       enum insn_code icode;
>> -      icode = optab_handler (cbranch_optab, cmp_mode);
>> +      icode = optab_handler (optab, cmp_mode);
>>       if (icode != CODE_FOR_nothing
>>          && insn_operand_matches (icode, 0, test))
>>        {
>> @@ -4580,7 +4583,7 @@ prepare_cmp_insn (rtx x, rtx y, enum rtx_code
>> comparison, rtx size,
>>       if (comparison == UNORDERED && rtx_equal_p (x, y))
>>        {
>>          prepare_cmp_insn (x, y, UNLT, NULL_RTX, unsignedp, OPTAB_WIDEN,
>> -                           ptest, pmode);
>> +                           ptest, pmode, optab);
>>          if (*ptest)
>>            return;
>>        }
>> @@ -4632,7 +4635,7 @@ prepare_cmp_insn (rtx x, rtx y, enum rtx_code
>> comparison, rtx size,
>> 
>>       *pmode = ret_mode;
>>       prepare_cmp_insn (x, y, comparison, NULL_RTX, unsignedp, methods,
>> -                       ptest, pmode);
>> +                       ptest, pmode, optab);
>>     }
>> 
>>   return;
>> @@ -4816,12 +4819,68 @@ emit_cmp_and_jump_insns (rtx x, rtx y, enum
>> rtx_code comparison, rtx size,
>>      the target supports tbranch.  */
>>   machine_mode tmode = mode;
>>   direct_optab optab;
>> -  if (op1 == CONST0_RTX (GET_MODE (op1))
>> -      && validate_test_and_branch (val, &test, &tmode,
>> -                                  &optab) != CODE_FOR_nothing)
>> +  if (op1 == CONST0_RTX (GET_MODE (op1)))
>>     {
>> -      emit_cmp_and_jump_insn_1 (test, tmode, label, optab, prob, true);
>> -      return;
>> +      if (validate_test_and_branch (val, &test, &tmode,
>> +                                   &optab) != CODE_FOR_nothing)
>> +       {
>> +         emit_cmp_and_jump_insn_1 (test, tmode, label, optab, prob,
>> true);
>> +         return;
>> +       }
>> +
>> +      /* If we are comparing equality with 0, check if VAL is another
>> equality
>> +        comparison and if the target supports it directly.  */
>> +      if (val && TREE_CODE (val) == SSA_NAME
>> +         && VECTOR_BOOLEAN_TYPE_P (TREE_TYPE (val))
>> +         && (comparison == NE || comparison == EQ))
>> +       {
>> +         auto def_stmt = SSA_NAME_DEF_STMT (val);
>> 
>> I think you may only look at the definition via get_def_for_expr (aka
>> TER), and the result may be NULL.
>> 
>> +             if ((icode = optab_handler (optab, mode2))
>> +                 != CODE_FOR_nothing
>> 
>> can we perform this check before expanding the operands please?
>> Can we use vector_compare_rtx ().
>> 
>> Somehow I feel this belongs in a separate function like
>> validate_test_and_branch.
>> 
>> Richard.
>> 
>>> I think in the patch you've missed that the RTL from the expand is 
>>> overwritten.
>>> 
>>> Basically the new optabs are doing essentially the same thing that you
>>> are suggesting, but without needing the dummy second comparison_operator.
>>> 
>>> Notice how in the patch the comparison operator is the general
>>> 
>>> aarch64_comparison_operator
>>> 
>>> and not the stricter aarch64_equality_operator.
>>> 
>>> Thanks,
>>> Tamar
>>>> 
>>>> Richard.
>>>> 
>>>>> I.e. with them we can now generate
>>>>> 
>>>>> .L2:
>>>>>        ldr     q31, [x1, x2]
>>>>>        add     v29.4s, v29.4s, v25.4s
>>>>>        add     v28.4s, v28.4s, v26.4s
>>>>>        add     v31.4s, v31.4s, v30.4s
>>>>>        str     q31, [x1, x2]
>>>>>        add     x1, x1, 16
>>>>>        cmp     x1, 2560
>>>>>        beq     .L1
>>>>> .L6:
>>>>>        ldr     q30, [x3, x1]
>>>>>        cmpeq   p15.s, p7/z, z30.s, z27.s
>>>>>        b.none  .L2
>>>>> 
>>>>> and easily prove it correct.
>>>>> 
>>>>> Bootstrapped Regtested on aarch64-none-linux-gnu,
>>>>> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
>>>>> -m32, -m64 and no issues.
>>>>> 
>>>>> Ok for master?
>>>>> 
>>>>> Thanks,
>>>>> Tamar
>>>>> 
>>>>> gcc/ChangeLog:
>>>>> 
>>>>>    PR target/118974
>>>>>    * optabs.cc (prepare_cmp_insn): Refactor to take optab to check for
>>>>>    instead of hardcoded cbranch.
>>>>>    (emit_cmp_and_jump_insns): Try to emit a vec_cbranch if supported.
>>>>>    * optabs.def (vec_cbranch_any_optab, vec_cbranch_all_optab): New.
>>>>>    * doc/md.texi (cbranch_any@var{mode}4, cbranch_any@var{mode}4):
>>>> Document
>>>>>    them.
>>>>> 
>>>>> ---
>>>>> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
>>>>> index
>>>> 
>> f6314af46923beee0100a1410f089efd34d7566d..7dbfbbe1609a196b0834d458
>>>> b26f61904eaf5b24 100644
>>>>> --- a/gcc/doc/md.texi
>>>>> +++ b/gcc/doc/md.texi
>>>>> @@ -7622,6 +7622,24 @@ Operand 0 is a comparison operator.  Operand 1
>>>> and operand 2 are the
>>>>> first and second operands of the comparison, respectively.  Operand 3
>>>>> is the @code{code_label} to jump to.
>>>>> 
>>>>> +@cindex @code{cbranch_any@var{mode}4} instruction pattern
>>>>> +@item @samp{cbranch_any@var{mode}4}
>>>>> +Conditional branch instruction combined with a compare instruction on
>> vectors
>>>>> +where it is required that at least one of the elementwise comparisons of 
>>>>> the
>>>>> +two input vectors is true.
>>>>> +Operand 0 is a comparison operator.  Operand 1 and operand 2 are the
>>>>> +first and second operands of the comparison, respectively.  Operand 3
>>>>> +is the @code{code_label} to jump to.
>>>>> +
>>>>> +@cindex @code{cbranch_all@var{mode}4} instruction pattern
>>>>> +@item @samp{cbranch_all@var{mode}4}
>>>>> +Conditional branch instruction combined with a compare instruction on
>> vectors
>>>>> +where it is required that at all of the elementwise comparisons of the
>>>>> +two input vectors are true.
>>>>> +Operand 0 is a comparison operator.  Operand 1 and operand 2 are the
>>>>> +first and second operands of the comparison, respectively.  Operand 3
>>>>> +is the @code{code_label} to jump to.
>>>>> +
>>>>> @cindex @code{jump} instruction pattern
>>>>> @item @samp{jump}
>>>>> A jump inside a function; an unconditional branch.  Operand 0 is the
>>>>> diff --git a/gcc/optabs.cc b/gcc/optabs.cc
>>>>> index
>>>> 
>> 0a14b1eef8a5795e6fd24ade6da55841696315b8..77d5e6ee5d26ccda3d39126
>>>> 5bc45fd454530cc67 100644
>>>>> --- a/gcc/optabs.cc
>>>>> +++ b/gcc/optabs.cc
>>>>> @@ -4418,6 +4418,9 @@ can_vec_extract_var_idx_p (machine_mode
>>>> vec_mode, machine_mode extr_mode)
>>>>> 
>>>>>    *PMODE is the mode of the inputs (in case they are const_int).
>>>>> 
>>>>> +   *OPTAB is the optab to check for OPTAB_DIRECT support.  Defaults to
>>>>> +   cbranch_optab.
>>>>> +
>>>>>    This function performs all the setup necessary so that the caller only 
>>>>> has
>>>>>    to emit a single comparison insn.  This setup can involve doing a 
>>>>> BLKmode
>>>>>    comparison or emitting a library call to perform the comparison if no 
>>>>> insn
>>>>> @@ -4429,7 +4432,7 @@ can_vec_extract_var_idx_p (machine_mode
>>>> vec_mode, machine_mode extr_mode)
>>>>> static void
>>>>> prepare_cmp_insn (rtx x, rtx y, enum rtx_code comparison, rtx size,
>>>>>          int unsignedp, enum optab_methods methods,
>>>>> -          rtx *ptest, machine_mode *pmode)
>>>>> +          rtx *ptest, machine_mode *pmode, optab
>>>> optab=cbranch_optab)
>>>>> {
>>>>>   machine_mode mode = *pmode;
>>>>>   rtx libfunc, test;
>>>>> @@ -4547,7 +4550,7 @@ prepare_cmp_insn (rtx x, rtx y, enum rtx_code
>>>> comparison, rtx size,
>>>>>   FOR_EACH_WIDER_MODE_FROM (cmp_mode, mode)
>>>>>     {
>>>>>       enum insn_code icode;
>>>>> -      icode = optab_handler (cbranch_optab, cmp_mode);
>>>>> +      icode = optab_handler (optab, cmp_mode);
>>>>>       if (icode != CODE_FOR_nothing
>>>>>      && insn_operand_matches (icode, 0, test))
>>>>>    {
>>>>> @@ -4580,7 +4583,7 @@ prepare_cmp_insn (rtx x, rtx y, enum rtx_code
>>>> comparison, rtx size,
>>>>>       if (comparison == UNORDERED && rtx_equal_p (x, y))
>>>>>    {
>>>>>      prepare_cmp_insn (x, y, UNLT, NULL_RTX, unsignedp, OPTAB_WIDEN,
>>>>> -                ptest, pmode);
>>>>> +                ptest, pmode, optab);
>>>>>      if (*ptest)
>>>>>        return;
>>>>>    }
>>>>> @@ -4632,7 +4635,7 @@ prepare_cmp_insn (rtx x, rtx y, enum rtx_code
>>>> comparison, rtx size,
>>>>> 
>>>>>       *pmode = ret_mode;
>>>>>       prepare_cmp_insn (x, y, comparison, NULL_RTX, unsignedp, methods,
>>>>> -            ptest, pmode);
>>>>> +            ptest, pmode, optab);
>>>>>     }
>>>>> 
>>>>>   return;
>>>>> @@ -4816,12 +4819,68 @@ emit_cmp_and_jump_insns (rtx x, rtx y, enum
>>>> rtx_code comparison, rtx size,
>>>>>      the target supports tbranch.  */
>>>>>   machine_mode tmode = mode;
>>>>>   direct_optab optab;
>>>>> -  if (op1 == CONST0_RTX (GET_MODE (op1))
>>>>> -      && validate_test_and_branch (val, &test, &tmode,
>>>>> -                   &optab) != CODE_FOR_nothing)
>>>>> +  if (op1 == CONST0_RTX (GET_MODE (op1)))
>>>>>     {
>>>>> -      emit_cmp_and_jump_insn_1 (test, tmode, label, optab, prob, true);
>>>>> -      return;
>>>>> +      if (validate_test_and_branch (val, &test, &tmode,
>>>>> +                    &optab) != CODE_FOR_nothing)
>>>>> +    {
>>>>> +      emit_cmp_and_jump_insn_1 (test, tmode, label, optab, prob, true);
>>>>> +      return;
>>>>> +    }
>>>>> +
>>>>> +      /* If we are comparing equality with 0, check if VAL is another 
>>>>> equality
>>>>> +     comparison and if the target supports it directly.  */
>>>>> +      if (val && TREE_CODE (val) == SSA_NAME
>>>>> +      && VECTOR_BOOLEAN_TYPE_P (TREE_TYPE (val))
>>>>> +      && (comparison == NE || comparison == EQ))
>>>>> +    {
>>>>> +      auto def_stmt = SSA_NAME_DEF_STMT (val);
>>>>> +      enum insn_code icode;
>>>>> +      if (is_gimple_assign (def_stmt)
>>>>> +          && TREE_CODE_CLASS (gimple_assign_rhs_code (def_stmt))
>>>>> +           == tcc_comparison)
>>>>> +        {
>>>>> +          class expand_operand ops[2];
>>>>> +          rtx_insn *tmp = NULL;
>>>>> +          start_sequence ();
>>>>> +          rtx op0c = expand_normal (gimple_assign_rhs1 (def_stmt));
>>>>> +          rtx op1c = expand_normal (gimple_assign_rhs2 (def_stmt));
>>>>> +          machine_mode mode2 = GET_MODE (op0c);
>>>>> +          create_input_operand (&ops[0], op0c, mode2);
>>>>> +          create_input_operand (&ops[1], op1c, mode2);
>>>>> +
>>>>> +          int unsignedp2 = TYPE_UNSIGNED (TREE_TYPE (val));
>>>>> +          auto inner_code = gimple_assign_rhs_code (def_stmt);
>>>>> +          rtx test2 = NULL_RTX;
>>>>> +
>>>>> +          enum rtx_code comparison2 = get_rtx_code (inner_code, 
>>>>> unsignedp2);
>>>>> +          if (unsignedp2)
>>>>> +        comparison2 = unsigned_condition (comparison2);
>>>>> +          if (comparison == NE)
>>>>> +        optab = vec_cbranch_any_optab;
>>>>> +          else
>>>>> +        optab = vec_cbranch_all_optab;
>>>>> +
>>>>> +          if ((icode = optab_handler (optab, mode2))
>>>>> +          != CODE_FOR_nothing
>>>>> +          && maybe_legitimize_operands (icode, 1, 2, ops))
>>>>> +        {
>>>>> +          prepare_cmp_insn (ops[0].value, ops[1].value, comparison2,
>>>>> +                    size, unsignedp2, OPTAB_DIRECT, &test2,
>>>>> +                    &mode2, optab);
>>>>> +          emit_cmp_and_jump_insn_1 (test2, mode2, label,
>>>>> +                        optab, prob, false);
>>>>> +          tmp = get_insns ();
>>>>> +        }
>>>>> +
>>>>> +          end_sequence ();
>>>>> +          if (tmp)
>>>>> +        {
>>>>> +          emit_insn (tmp);
>>>>> +          return;
>>>>> +        }
>>>>> +        }
>>>>> +    }
>>>>>     }
>>>>> 
>>>>>   emit_cmp_and_jump_insn_1 (test, mode, label, cbranch_optab, prob,
>> false);
>>>>> diff --git a/gcc/optabs.def b/gcc/optabs.def
>>>>> index
>>>> 
>> 23f792352388dd5f8de8a1999643179328214abf..a600938fb9e4480ff2d78c9b
>>>> 0be344e4856539bb 100644
>>>>> --- a/gcc/optabs.def
>>>>> +++ b/gcc/optabs.def
>>>>> @@ -424,6 +424,8 @@ OPTAB_D (smulhrs_optab, "smulhrs$a3")
>>>>> OPTAB_D (umulhs_optab, "umulhs$a3")
>>>>> OPTAB_D (umulhrs_optab, "umulhrs$a3")
>>>>> OPTAB_D (sdiv_pow2_optab, "sdiv_pow2$a3")
>>>>> +OPTAB_D (vec_cbranch_any_optab, "vec_cbranch_any$a4")
>>>>> +OPTAB_D (vec_cbranch_all_optab, "vec_cbranch_all$a4")
>>>>> OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
>>>>> OPTAB_D (vec_pack_ssat_optab, "vec_pack_ssat_$a")
>>>>> OPTAB_D (vec_pack_trunc_optab, "vec_pack_trunc_$a")
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Richard Biener <rguent...@suse.de>
>>>> SUSE Software Solutions Germany GmbH,
>>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
>> Nuernberg)
>>> 
>> 
>> --
>> Richard Biener <rguent...@suse.de>
>> SUSE Software Solutions Germany GmbH,
>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
Re: [PATCH 1/3]middle-end: support vec_cbranch_any and vec_cbranch_all [PR118974]

Reply via email to