Hi Damian -

Regarding vectorization and reductions, that's an interesting topic.

First, I'll point out that we (as language designers) intend that
forall loops (including the one in your forall expression [i in 1..n] )
should generally be vectorizeable.

Second, at some point in history, we tried telling the back-end
C compiler that the C for loops implementing forall statements
were vectorizeable, but actually the pragma they provide is
"order independent" and loops with reductions currently
aren't actually exactly "order independent" because the reduction
accumulation step shouldn't be interleaved between vector
lanes. (Since there is one accumulator shared by all of the
vector lanes).

Third, since that problem was discovered, we've created a 
GitHub issue to discuss our expectation for how vectorization
will work, from a high level. That is here:

 https://github.com/chapel-lang/chapel/issues/7761

Note that there is a section in there about reductions.
I think the best thing to do for reductions is to generate
1 reduction accumulator *per vector lane* and then
combine these after each vectorized for loop. There are
three challenges with that:
 1. To the extent we'd like user code to be able to do
      the same thing as the reduction, we'd need to
      have a language interface for things like
      "get my vector lane" and "how many vector lanes are there"
      (or possibly "declare tuple of vector width").
  2. We need to adjust the compiler to implement
       reductions in this alternative manner.
  3. We need a vectorizing component (whether
       in the C compiler or no) that allows this form
       of code. I'm currently excited about using
       the LLVM library Region Vectorizer for this
       purpose.

For various reasons though, the C compiler vectorizers
are pretty good at adding runtime checks and vectorized
variants of code. Doesn't always work but often does.
That makes less "performance win" available for the
improvements I'm talking about above.

Cheers,

-michael
    
    If somebody is looking for a discussion to have around the coffee machine:
    
    Consider a computation like
    
        var s = + reduce [i in 1..n] x[i] * y[i];
    
    Is this likely to produce optimal vector code, if not now, into the 
    future when there is a native compiler.
    
    Or does the language need extra features to achieve things like
    
        dot(-product), e.g. dot(x[1..n], y[1..n])
    or
        sum
    
    separate from (and simpler than) a reduction to achieve this.
    
    There are probably a heap of issues that this raises, including (say) the 
    (automatic) blocking to get the best out of a vector unit. Details. I am 
    not sure even Intel's C++ compiler solves that transparently and Intel 
    suggests that programmers need to do the blocking by themselves. Search
    
        Intel C++ Array Notation
    
    and see the discussion in several of the documents that it pulls up.
    
    Regards - Damian
    
    Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
    Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
    Views & opinions here are mine and not those of any past or present employer
    
    
------------------------------------------------------------------------------
    Check out the vibrant tech community on one of the world's most
    engaging tech sites, Slashdot.org! http://sdm.link/slashdot
    _______________________________________________
    Chapel-developers mailing list
    [email protected]
    https://lists.sourceforge.net/lists/listinfo/chapel-developers
    

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to