Hi -

See replies inline below
    
    I am comparing a linear algebra algorithm written in Fortran with my own 
    Chapel implementation of the same algorithm. About 90% of the computation 
    time of (slightly more than) half of my algorithm is an operation which 
    can be written as
    
      inline proc tpxyT(ref t : [?tD] ?R, const ref x : [] R, const ref y : [] 
R)
      {
        const (rows, columns) = tD.dims();
    
        [ (i, j) in tD ] t[i, j] += x[i] * y[j];
      }
    
    I will come back to the above shortly.
    
    I want to compare my code against Fortran.
    
    I want this comparison to be beyond fair! That is the bottom line on this 
    posting but do feel free to tell me the above code is less than perfect.
    
    Looking at the last line, it is highly parallelizable. But, I assume when 
    called from somewhere within a routine that is itself called from with a 
    serial block, and with the program also fed
    
        --dataParTasksPerLocale=1
    
    then no parallelization will happen, i.e. that construct is just a 
    whopping great hint to the optimizer.

There is some per-loop overhead from using a forall expression in this
way as compared to a `for` loop. Actually

for (i,j) in vectorizeOnly(tD) {
  t[i,j] += x[i] * y[j];
}

would equivalently hint vectorizability (but be aware that
these hints don't do anything today with the C backend).

    Is my guess correct?
    
    Would people agree that using this Chapel feature is still scrupulously 
    fair (in my comparison against Fortran)?
    
    If Chapel vectorized remotely well, the following is superior:
    
        [i in rows] t[i, ..] += x[i] * y;
    
    But Chapel is not so good on vectorization, so this approach was dropped. 

I suspect the issue with this construct is more to do with the 
overhead of constructing and operating on array views -
combined with nested parallelism. It'd be interesting to try
to improve the performance of this case if it's the way
you'd really like to express your program.

Today, the expression `t[i, ..]` allocates memory in the process
of creating the array view. That can  cause performance problems.
I don't know if that's causing problems in your case here.

Anyway it's also possible that overhead from array views is
preventing vectorization within the C compiler.

    Note that compared to the first approach and with matrices of the size of 
    1600*1600 to 4000*4000, my semi-rigorous experiments suggest that this has 
    somewhere between 10-20% longer run-time using 1.22.0 and a GCC backend 
    and various Intel Xeon CPUs. I think this second line is a far superior 
    description of what I am trying to do and I await better vectorization
    in Chapel.  I was thinking of writing some C-code to do
    
        t[i,..] += x[i] * y
    
    but then realised that I have not a clue how to write an interface like
    
    proc tupv(n : int, m : int, t : T, ref u : [] T, const ref v [] T)
    
    where T is real(32) or real(64) and then use that interface or even if 
    that interface would be the best one.

I don't think we currently have the ability to pass array views
(such as `t[i, ..]` into `extern` functions. That might be something
we could extend `chpl_external_array` to do, though.

-michael
    
    Thanks - Damian
    
    Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
    Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
    Views & opinions here are mine and not those of any past or present employer
    
    
    _______________________________________________
    Chapel-developers mailing list
    [email protected]
    https://lists.sourceforge.net/lists/listinfo/chapel-developers 
    


_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to