Junchao Zhang <junchao.zh...@gmail.com> writes:

> I expected TACO was better since its website says "It uses novel compiler
> techniques to get performance competitive with hand-optimized kernels"

The TACO generated code is basically the same, modulo cosmetic partitioning 
into groups of 32 rows.

  #pragma omp parallel for schedule(runtime)
  for (int32_t i0 = 0; i0 < ((A1_dimension + 31) / 32); i0++) {
    for (int32_t i1 = 0; i1 < 32; i1++) {
      int32_t i = i0 * 32 + i1;
      if (i >= A1_dimension)
        continue;

      double tjy_val = 0.0;
      for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
        int32_t j = A2_crd[jA];
        tjy_val += A_vals[jA] * x_vals[j];
      }
      y_vals[i] = tjy_val;
    }
  }

As Matt says, this is mostly a bandwidth test. You can do somewhat better for 
matrices with similar row lengths using SELL (which enables vector 
instructions), but it's secondary to bandwidth for matrix values and cache 
locality for the vector.

With a graph and a row-based algorithm, I would partition it using METIS or 
similar, then apply an RCM ordering within each block. This will give better 
locality of access to the vector x, which has been known (since at least the 
PETSc-FUN3D work in the late 90s) to be a major factor in SpMV performance.

Working in naive orderings is one of the reasons the OSKI work was able to 
claim significant performance benefits while having virtually no impact on 
well-optimized applications (which choose a good ordering before forming their 
matrices).

Reply via email to