Hi Damian - Regarding vectorization and reductions, that's an interesting topic.
First, I'll point out that we (as language designers) intend that forall loops (including the one in your forall expression [i in 1..n] ) should generally be vectorizeable. Second, at some point in history, we tried telling the back-end C compiler that the C for loops implementing forall statements were vectorizeable, but actually the pragma they provide is "order independent" and loops with reductions currently aren't actually exactly "order independent" because the reduction accumulation step shouldn't be interleaved between vector lanes. (Since there is one accumulator shared by all of the vector lanes). Third, since that problem was discovered, we've created a GitHub issue to discuss our expectation for how vectorization will work, from a high level. That is here: https://github.com/chapel-lang/chapel/issues/7761 Note that there is a section in there about reductions. I think the best thing to do for reductions is to generate 1 reduction accumulator *per vector lane* and then combine these after each vectorized for loop. There are three challenges with that: 1. To the extent we'd like user code to be able to do the same thing as the reduction, we'd need to have a language interface for things like "get my vector lane" and "how many vector lanes are there" (or possibly "declare tuple of vector width"). 2. We need to adjust the compiler to implement reductions in this alternative manner. 3. We need a vectorizing component (whether in the C compiler or no) that allows this form of code. I'm currently excited about using the LLVM library Region Vectorizer for this purpose. For various reasons though, the C compiler vectorizers are pretty good at adding runtime checks and vectorized variants of code. Doesn't always work but often does. That makes less "performance win" available for the improvements I'm talking about above. Cheers, -michael If somebody is looking for a discussion to have around the coffee machine: Consider a computation like var s = + reduce [i in 1..n] x[i] * y[i]; Is this likely to produce optimal vector code, if not now, into the future when there is a native compiler. Or does the language need extra features to achieve things like dot(-product), e.g. dot(x[1..n], y[1..n]) or sum separate from (and simpler than) a reduction to achieve this. There are probably a heap of issues that this raises, including (say) the (automatic) blocking to get the best out of a vector unit. Details. I am not sure even Intel's C++ compiler solves that transparently and Intel suggests that programmers need to do the blocking by themselves. Search Intel C++ Array Notation and see the discussion in several of the documents that it pulls up. Regards - Damian Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037 Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here Views & opinions here are mine and not those of any past or present employer ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Chapel-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-developers ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Chapel-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-developers
