Hi Robin,
 Thanks for your reply.
> > +
> > + /* Cost of vector reduction operations (unordered / tree reduction).
> > + Indexed by element type. */
> > + const int reduc_i8_cost;
> > + const int reduc_i16_cost;
> > + const int reduc_i32_cost;
> > + const int reduc_i64_cost;
> > + const int reduc_f16_cost;
> > + const int reduc_f32_cost;
> > + const int reduc_f64_cost;
> Do we need all of those? I'm not sure but given that they are supposed to be 
> implemented as tree reductions, the latency should not vary too much WRT the 
> element size?
This is inspired by `sve_vec_cost` of aarch64, it might not be necessary in 
generic or rocket, but maybe some uarchs have different implementations, I 
think it's ok to preserve all these types to help the downstream to tune for 
their uarchs.
> > +
> > + /* Cost of ordered (fold-left / strict) floating-point reductions.
> > + These are significantly more expensive than unordered (tree) reductions
> > + because RVV ordered reduction instructions (e.g. vfredosum) process
> > + elements sequentially. */
> > + const int reduc_f16_ordered_cost;
> > + const int reduc_f32_ordered_cost;
> > + const int reduc_f64_ordered_cost;
> Same here, I'm not entirely sure and uarchs might vary (wildly) but generally 
> these should scale linearly with the number of elements so perhaps once 
> factor 
> is enough? Open for debate, though.
The ordered cost could be different depending on elements' size, as far as I 
can see in XuanTie C950(the mcpu and mtune will be commited in further patch), 
it is not scale linearly.
> > /* scalable vectorization (VLA) specific cost. */
> > @@ -289,7 +307,7 @@ struct scalable_vector_cost : common_vector_cost
> > {}
> > 
> > /* TODO: We will need more other kinds of vector cost for VLA.
> > - E.g. fold_left reduction cost, lanes load/store cost, ..., etc. */
> > + E.g. lanes load/store cost, ..., etc. */
> > };
> We have lane cost, so this comment can be removed. 
Get, I will fix this after all decisions are made.
> > --- a/gcc/config/riscv/riscv.cc
> > +++ b/gcc/config/riscv/riscv.cc
> > @@ -415,6 +415,16 @@ static const common_vector_cost rvv_vls_vector_cost = {
> > 1, /* align_store_cost */
> > 2, /* unalign_load_cost */
> > 2, /* unalign_store_cost */
> > + 2, /* reduc_i8_cost */
> > + 2, /* reduc_i16_cost */
> > + 2, /* reduc_i32_cost */
> > + 2, /* reduc_i64_cost */
> > + 2, /* reduc_f16_cost */
> > + 2, /* reduc_f32_cost */
> > + 2, /* reduc_f64_cost */
> > + 6, /* reduc_f16_ordered_cost */
> > + 4, /* reduc_f32_ordered_cost */
> > + 2, /* reduc_f64_ordered_cost */
> > };
> Any reason why the scaling is not *2 but rather +2? I'd have expected twice 
> the work (and thus, latency) for 2x elements. Also, even 2-6 seem rather low 
> compared to regular reductions? Looking at the published Ascalon X numbers, 
> it's more like 5, 10, 20.
Yes, I agree with you, considering that this is a common cost, I set these 
costs to not make too much effects. In our uarchs, it might be 10 ~ 20 
depending on elements' size. Is 20 for f16, 10 for f32 and 5 for f64 good to 
you, or some one has other opinion.
> > diff --git 
> > a/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_cost-1.c 
> > b/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_cost-1.c
> Distinct cost-model tests are better put into the costmodel sub directory.
> > #include "wred-2.c"
> > -/* { dg-final { scan-assembler-times {vfwredosum\.vs} 17 } } */
> > +/* The _Float16->float n=4 case is not vectorized because the ordered
> > + reduction cost makes it unprofitable for small trip counts. */
> > +/* { dg-final { scan-assembler-times {vfwredosum\.vs} 16 } } */
> This is supposed to test functionality so I'd rather keep the expectation and 
> add -fno-vect-cost-model.
I will resolve these later.
-- 
Best Regards
 Yaduo

Reply via email to