I'm developing an iterative optimization algorithm in Julia along the lines of other contributions to the Iterative Solvers project<https://github.com/JuliaLang/IterativeSolvers.jl>or Krylov Subspace <https://github.com/JuliaLang/IterativeSolvers.jl/blob/master/src/krylov.jl>module whose only computationally intensive step is computing A*b or A'*b. I would like to parallelize the method by using a parallel sparse matrix vector multiply. Is there a standard backend matrix-vector multiply that's recommended in Julia if I'm targeting a shared memory computer with a large number of processors? Similarly, is there a recommended backend for targeting a cluster? My matrices can easily reach 10 million rows by 1 million columns, with sparsity anywhere from .01% to problems that are nearly diagonal.
I've seen many posts <https://github.com/JuliaLang/julia/issues/2645> talking about integrating PETSc as a backend for this purpose, but it looks like the project <https://github.com/petsc/petsc/blob/master/bin/julia/PETSc.jl>has stalled - the last commits I see are a year ago. I'm also interested in other backends, eg Spark <http://spark.incubator.apache.org/>, SciDB<http://scidb.org/>, etc. I'm more interested in solving sparse problems, but as a side note, the built-in BLAS acceleration by changing the number of threads `blas_set_num_threads` works ok for dense problems using a moderate number of processors. I wonder why the number of threads isn't set higher than one by default, for example, using as many as nprocs() cores?
