Thanks for all the advice, everyone. I've just finished a parallel sparse matrix vector multiplication library written in straight julia for shared memory machines, using Amit Murthy's SharedArrays. You can find it at https://github.com/madeleineudell/ParallelSparseMatMul.jl
For a matrix with 1E9 nonzeros, its performance on 6 processors is the same as MKL, and it retains linear parallel scaling up to the highest I've tested (40 threads). (serial julia) julia> @time S'*x; elapsed time: 17.63098703 seconds (8410364 bytes allocated) (mkl, 6 threads) julia> @time Base.LinAlg.SparseBLAS.cscmv!(transa,alpha,matdescra,A,x,beta,y); elapsed time: 4.334168257 seconds (48 bytes allocated) (ParallelSparseMatMul, 6 threads) julia> @time y = A'*x; elapsed time: 4.18557078 seconds (1858156 bytes allocated) On Tue, Feb 11, 2014 at 1:44 PM, Madeleine Udell <[email protected]>wrote: > Thanks, Jake. It seems like I'm still not able to go above 6 threads, but > that may be deep in my MKL install. > > I've begun implementing a shared sparse matrix class in native Julia that > seems to scale better. The code is attached to this email. How can I > overload A'*b for my SharedBilinearOperator class, so that I can call > At_mul_B on a stored copy of A', rather than computing A' each time I want > to multiply by it? ie can I write something like > > ('*)(L::SharedBilinearOperator,x) = At_mul_B(L.A,x) > > ? > > > On Tue, Feb 11, 2014 at 8:40 AM, Jake Bolewski <[email protected]>wrote: > >> Hey Madeleine, >> >> First I would check that your global environment >> variables<http://software.intel.com/sites/products/documentation/hpc/mkl/mkl_userguide_lnx/GUID-0DE6A77B-00E0-4ED6-9CAE-52FCF49E5623.htm>are >> set up correctly. If you want to set up the number of threads >> programmatically: >> >> _ _ _(_)_ | A fresh approach to technical computing >> (_) | (_) (_) | Documentation: http://docs.julialang.org >> _ _ _| |_ __ _ | Type "help()" to list help topics >> | | | | | | |/ _` | | >> | | |_| | | | (_| | | Version 0.3.0-prerelease+1470 (2014-02-08 16:23 >> UTC) >> _/ |\__'_|_|_|\__'_| | Commit 596d5c4* (3 days old master) >> |__/ | x86_64-unknown-linux-gnu >> >> julia> Base.blas_vendor() >> :mkl >> >> julia> Base.blas_set_num_threads(12) >> >> >> You can see the relevant code here: >> https://github.com/JuliaLang/julia/blob/8ac5a7f7fff1c54a768c7bc9ae85cf053d310f42/base/util.jl#L293 >> . >> It's always worth a quick search through the code base to figure out what >> is going on behind the scenes. >> >> Hope that helps! >> Jake >> >> On Tuesday, February 11, 2014 12:34:56 AM UTC-5, Madeleine Udell wrote: >>> >>> Jake, thanks for the >>> reference<http://software.intel.com/en-us/forums/topic/294954>; >>> I have 32 hyperthreaded cores, so there's definitely something else going >>> on to limit me to 6 in addition to not exploiting hyperthreading. >>> >> >> >>> Perhaps I need to call something like omp_set_num_threads()? But there >>> doesn't seem to be a function by this name in the libmkl_rt library. >>> >>> julia> ccall((:omp_set_num_threads,Base.libblas_name),Ptr{Void},( >>> Uint8,),32) >>> ERROR: ccall: could not find function omp_set_num_threads in library >>> libmkl_rt >>> in anonymous at no file >>> >>> >>> On Mon, Feb 10, 2014 at 4:05 PM, Jake Bolewski <[email protected]>wrote: >>> >>>> Are those hyper-threaded "cores"? If so, check Intel MKL's >>>> documentation on hyper-threading. >>>> >>>> -Best >>>> Jake >>>> >>>> On Monday, February 10, 2014 6:38:50 PM UTC-5, Madeleine Udell wrote: >>>> >>>>> It looks like only 6 threads are being used when I use mkl from julia. >>>>> If I do blas_set_num_threads(n), then using top, I see julia is >>>>> running at min(n,6)*100% cpu. Any idea why this would be, or how to >>>>> get mkl to use more threads? I'm not sure if this is an issue in julia >>>>> or with my installation of mkl. >>>>> >>>>> On Mon, Feb 10, 2014 at 2:09 PM, Andreas Noack Jensen >>>>> <[email protected]> wrote: >>>>> > You are welcome. Good to hear that it worked. >>>>> > >>>>> > >>>>> > 2014-02-10 22:35 GMT+01:00 Madeleine Udell <[email protected]>: >>>>> > >>>>> >> fantastic, thank you. I now see Base.libblas_name="libmkl_rt", and >>>>> am able >>>>> >> to run the following test successfully: >>>>> >> >>>>> >> transa = 'N'::Base.LinAlg.BlasChar # multiply by A, not A' >>>>> >> matdescra = "GXXF" # G = general, X = ignored, F = one-based >>>>> indexing >>>>> >> m,n = 50,100 >>>>> >> A = sprand(m,n,.01) >>>>> >> y = zeros(m) >>>>> >> x = rand(n) >>>>> >> alpha = 1.0 >>>>> >> beta = 1.0 >>>>> >> >>>>> >> Base.LinAlg.SparseBLAS.cscmv!(transa,alpha,matdescra,A,x,beta,y) >>>>> >> y_builtin = A*x >>>>> >> >>>>> >> julia> y==y_builtin >>>>> >> true >>>>> >> >>>>> >> >>>>> >> On Mon, Feb 10, 2014 at 12:08 PM, Andreas Noack Jensen >>>>> >> <[email protected]> wrote: >>>>> >>> >>>>> >>> Hi Madeleine >>>>> >>> >>>>> >>> When compiling julia with MKL you'll have to do make cleanall as >>>>> well as >>>>> >>> rebuild ARPACK and Suitesparse, i.e. make -C distclean-arpack and >>>>> make -C >>>>> >>> distclean-suitesparse. It is also easier to create a Make.user >>>>> there you set >>>>> >>> USE_MKL=1 and MKLROOT to the location of your MKL library files, >>>>> e.g. >>>>> >>> /opt/intel/mkl. The arguments are explained here >>>>> >>> >>>>> >>> >>>>> >>> http://software.intel.com/sites/products/documentation/hpc/ >>>>> mkl/mklman/GUID-C2EE93F0-B573-4538-A994-202CB3ADFFC2.htm# >>>>> GUID-C2EE93F0-B573-4538-A994-202CB3ADFFC2 >>>>> >>> >>>>> >>> but transa determines if the operation is Ax or A'x and matdescra >>>>> has >>>>> >>> information about the structure of the matrix, e.g. if it is >>>>> triangular. >>>>> >>> When you have succeeded in compiling julia with MKL, the libblas >>>>> variable >>>>> >>> should just be Base.libblas_name. >>>>> >>> >>>>> >>> >>>>> >>> 2014-02-10 20:37 GMT+01:00 Madeleine Udell <[email protected]>: >>>>> >>>>> >>> >>>>> >>>> I'm having trouble using MKL in julia. I changed Make.inc so that >>>>> >>>> USE_MKL = 1, >>>>> >>>> but when I make and run julia, I find that Base.libblas_name = >>>>> >>>> "libopenblas". Is this expected? I would have thought it would be >>>>> eg >>>>> >>>> "libmkl_core". >>>>> >>>> >>>>> >>>> Andreas, I found your wrappers for MKL in this PR. I've not used >>>>> MKL >>>>> >>>> before, so I don't understand the call signature of those >>>>> functions in order >>>>> >>>> to call MKL directly. Any chance you could explain what are >>>>> transa::BlasChar >>>>> >>>> and matdescra::ASCIIString in the following function, and which >>>>> mkl library >>>>> >>>> is expected in the libblas variable? I see many .so files in the >>>>> lib/intel64 >>>>> >>>> directory of my mkl installation, and I'm not sure which one to >>>>> point julia >>>>> >>>> to. >>>>> >>>> >>>>> >>>> function cscmv!(transa::BlasChar, α::$T, matdescra::ASCIIString, >>>>> >>>> A::SparseMatrixCSC{$T, BlasInt}, x::StridedVector{$T}, β::$T, >>>>> >>>> y::StridedVector{$T}) >>>>> >>>> length(x) == A.n || throw(DimensionMismatch("Matrix with $(A.n) >>>>> columns >>>>> >>>> multiplied with vector of length $(length(x))")) >>>>> >>>> length(y) == A.m || throw(DimensionMismatch("Vector of length >>>>> $(A.m) >>>>> >>>> added to vector of length $(length(y))")) # >>>>> >>>> ccall(($(string(mv)), libblas), Void, >>>>> >>>> (Ptr{Uint8}, Ptr{BlasInt}, Ptr{BlasInt}, Ptr{$T}, >>>>> >>>> Ptr{Uint8}, Ptr{$T}, Ptr{BlasInt}, Ptr{BlasInt}, >>>>> >>>> Ptr{BlasInt}, Ptr{$T}, Ptr{$T}, Ptr{$T}), >>>>> >>>> &transa, &A.m, &A.n, &α, >>>>> >>>> matdescra, A.nzval, A.rowval, pointer(A.colptr, 1), >>>>> >>>> pointer(A.colptr, 2), x, &β, y) >>>>> >>>> return y >>>>> >>>> end >>>>> >>>> >>>>> >>>> Thanks for your help! >>>>> >>>> Madeleine >>>>> >>>> >>>>> >>>> >>>>> >>>> On Wed, Feb 5, 2014 at 1:49 PM, Andreas Noack Jensen >>>>> >>>> <[email protected]> wrote: >>>>> >>>>> >>>>> >>>>> A*b will not call MKL when A is sparse. There has been some >>>>> writing >>>>> >>>>> about making a MKL package that overwrites >>>>> A_mul_B(Matrix,Vector) with the >>>>> >>>>> MKL versions and I actually wrote wrappers for the sparse MKL >>>>> subroutines in >>>>> >>>>> the fall for the same reason. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 2014-02-05 Madeleine Udell <[email protected]>: >>>>> >>>>> >>>>> >>>>>> Miles, you're right that writing sparse matrix vector products >>>>> in >>>>> >>>>>> native Julia probably won't be the best idea given Julia's >>>>> model of >>>>> >>>>>> parallelism. That's why I'm interested in calling an outside >>>>> library like >>>>> >>>>>> PETSc. >>>>> >>>>>> >>>>> >>>>>> I see it's possible to link Julia with MKL. I haven't tried >>>>> this yet, >>>>> >>>>>> but if I do, will A*b (where A is sparse) call MKL to perform >>>>> the matrix >>>>> >>>>>> vector product? >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> On Wed, Feb 5, 2014 at 11:43 AM, Miles Lubin < >>>>> [email protected]> >>>>> >>>>>> wrote: >>>>> >>>>>>> >>>>> >>>>>>> Memory access is typically a significant bottleneck in sparse >>>>> >>>>>>> mat-vec, so unfortunately I'm skeptical that one could achieve >>>>> good >>>>> >>>>>>> performance using Julia's current distributed memory approach >>>>> on a multicore >>>>> >>>>>>> machine. This really calls for something like OpenMP. >>>>> >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> On Wednesday, February 5, 2014 11:42:00 AM UTC-5, Madeleine >>>>> Udell >>>>> >>>>>>> wrote: >>>>> >>>>>>>> >>>>> >>>>>>>> I'm developing an iterative optimization algorithm in Julia >>>>> along >>>>> >>>>>>>> the lines of other contributions to the Iterative Solvers >>>>> project or Krylov >>>>> >>>>>>>> Subspace module whose only computationally intensive step is >>>>> computing A*b >>>>> >>>>>>>> or A'*b. I would like to parallelize the method by using a >>>>> parallel sparse >>>>> >>>>>>>> matrix vector multiply. Is there a standard backend >>>>> matrix-vector multiply >>>>> >>>>>>>> that's recommended in Julia if I'm targeting a shared memory >>>>> computer with a >>>>> >>>>>>>> large number of processors? Similarly, is there a recommended >>>>> backend for >>>>> >>>>>>>> targeting a cluster? My matrices can easily reach 10 million >>>>> rows by 1 >>>>> >>>>>>>> million columns, with sparsity anywhere from .01% to problems >>>>> that are >>>>> >>>>>>>> nearly diagonal. >>>>> >>>>>>>> >>>>> >>>>>>>> I've seen many posts talking about integrating PETSc as a >>>>> backend >>>>> >>>>>>>> for this purpose, but it looks like the project has stalled - >>>>> the last >>>>> >>>>>>>> commits I see are a year ago. I'm also interested in other >>>>> backends, eg >>>>> >>>>>>>> Spark, SciDB, etc. >>>>> >>>>>>>> >>>>> >>>>>>>> I'm more interested in solving sparse problems, but as a side >>>>> note, >>>>> >>>>>>>> the built-in BLAS acceleration by changing the number of >>>>> threads >>>>> >>>>>>>> `blas_set_num_threads` works ok for dense problems using a >>>>> moderate number >>>>> >>>>>>>> of processors. I wonder why the number of threads isn't set >>>>> higher than one >>>>> >>>>>>>> by default, for example, using as many as nprocs() cores? >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> -- >>>>> >>>>>> Madeleine Udell >>>>> >>>>>> PhD Candidate in Computational and Mathematical Engineering >>>>> >>>>>> Stanford University >>>>> >>>>>> www.stanford.edu/~udell >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Med venlig hilsen >>>>> >>>>> >>>>> >>>>> Andreas Noack Jensen >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> -- >>>>> >>>> Madeleine Udell >>>>> >>>> PhD Candidate in Computational and Mathematical Engineering >>>>> >>>> Stanford University >>>>> >>>> www.stanford.edu/~udell >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> -- >>>>> >>> Med venlig hilsen >>>>> >>> >>>>> >>> Andreas Noack Jensen >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Madeleine Udell >>>>> >> PhD Candidate in Computational and Mathematical Engineering >>>>> >> Stanford University >>>>> >> www.stanford.edu/~udell >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > Med venlig hilsen >>>>> > >>>>> > Andreas Noack Jensen >>>>> >>>>> >>>>> >>>>> -- >>>>> Madeleine Udell >>>>> PhD Candidate in Computational and Mathematical Engineering >>>>> Stanford University >>>>> www.stanford.edu/~udell >>>>> >>>> >>> >>> >>> -- >>> Madeleine Udell >>> PhD Candidate in Computational and Mathematical Engineering >>> Stanford University >>> www.stanford.edu/~udell >>> >> > > > -- > Madeleine Udell > PhD Candidate in Computational and Mathematical Engineering > Stanford University > www.stanford.edu/~udell > -- Madeleine Udell PhD Candidate in Computational and Mathematical Engineering Stanford University www.stanford.edu/~udell
