Re: [julia-users] Re: Parallel sparse matrix vector multiplication

Madeleine Udell Thu, 13 Feb 2014 15:50:16 -0800

Thanks for all the advice, everyone. I've just finished a parallel sparse
matrix vector multiplication library written in straight julia for shared
memory machines, using Amit Murthy's SharedArrays. You can find it at
https://github.com/madeleineudell/ParallelSparseMatMul.jl


For a matrix with 1E9 nonzeros, its performance on 6 processors is the same
as MKL, and it retains linear parallel scaling up to the highest I've
tested (40 threads).

(serial julia)
julia> @time S'*x;
elapsed time: 17.63098703 seconds (8410364 bytes allocated)

(mkl, 6 threads)
julia> @time
Base.LinAlg.SparseBLAS.cscmv!(transa,alpha,matdescra,A,x,beta,y);
elapsed time: 4.334168257 seconds (48 bytes allocated)

(ParallelSparseMatMul, 6 threads)
julia> @time y = A'*x;
elapsed time: 4.18557078 seconds (1858156 bytes allocated)





On Tue, Feb 11, 2014 at 1:44 PM, Madeleine Udell
<[email protected]>wrote:

> Thanks, Jake. It seems like I'm still not able to go above 6 threads, but
> that may be deep in my MKL install.
>
> I've begun implementing a shared sparse matrix class in native Julia that
> seems to scale better. The code is attached to this email.  How can I
> overload A'*b for my SharedBilinearOperator class, so that I can call
> At_mul_B on a stored copy of A', rather than computing A' each time I want
> to multiply by it? ie can I write something like
>
> ('*)(L::SharedBilinearOperator,x) = At_mul_B(L.A,x)
>
> ?
>
>
> On Tue, Feb 11, 2014 at 8:40 AM, Jake Bolewski <[email protected]>wrote:
>
>> Hey Madeleine,
>>
>> First I would check that your global environment 
>> variables<http://software.intel.com/sites/products/documentation/hpc/mkl/mkl_userguide_lnx/GUID-0DE6A77B-00E0-4ED6-9CAE-52FCF49E5623.htm>are
>>  set up correctly.  If you want to set up the number of threads
>> programmatically:
>>
>>    _       _ _(_)_     |  A fresh approach to technical computing
>>   (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
>>    _ _   _| |_  __ _   |  Type "help()" to list help topics
>>   | | | | | | |/ _` |  |
>>   | | |_| | | | (_| |  |  Version 0.3.0-prerelease+1470 (2014-02-08 16:23
>> UTC)
>>  _/ |\__'_|_|_|\__'_|  |  Commit 596d5c4* (3 days old master)
>> |__/                   |  x86_64-unknown-linux-gnu
>>
>> julia> Base.blas_vendor()
>> :mkl
>>
>> julia> Base.blas_set_num_threads(12)
>>
>>
>> You can see the relevant code here:
>> https://github.com/JuliaLang/julia/blob/8ac5a7f7fff1c54a768c7bc9ae85cf053d310f42/base/util.jl#L293
>> .
>> It's always worth a quick search through the code base to figure out what
>> is going on behind the scenes.
>>
>> Hope that helps!
>> Jake
>>
>> On Tuesday, February 11, 2014 12:34:56 AM UTC-5, Madeleine Udell wrote:
>>>
>>> Jake, thanks for the 
>>> reference<http://software.intel.com/en-us/forums/topic/294954>;
>>> I have 32 hyperthreaded cores, so there's definitely something else going
>>> on to limit me to 6 in addition to not exploiting hyperthreading.
>>>
>>
>>
>>> Perhaps I need to call something like omp_set_num_threads()? But there
>>> doesn't seem to be a function by this name in the libmkl_rt library.
>>>
>>> julia> ccall((:omp_set_num_threads,Base.libblas_name),Ptr{Void},(
>>> Uint8,),32)
>>> ERROR: ccall: could not find function omp_set_num_threads in library
>>> libmkl_rt
>>>  in anonymous at no file
>>>
>>>
>>> On Mon, Feb 10, 2014 at 4:05 PM, Jake Bolewski <[email protected]>wrote:
>>>
>>>> Are those hyper-threaded "cores"?  If so, check Intel MKL's
>>>> documentation on hyper-threading.
>>>>
>>>> -Best
>>>> Jake
>>>>
>>>> On Monday, February 10, 2014 6:38:50 PM UTC-5, Madeleine Udell wrote:
>>>>
>>>>> It looks like only 6 threads are being used when I use mkl from julia.
>>>>> If I do blas_set_num_threads(n), then using top, I see julia is
>>>>> running at min(n,6)*100% cpu. Any idea why this would be, or how to
>>>>> get mkl to use more threads? I'm not sure if this is an issue in julia
>>>>> or with my installation of mkl.
>>>>>
>>>>> On Mon, Feb 10, 2014 at 2:09 PM, Andreas Noack Jensen
>>>>> <[email protected]> wrote:
>>>>> > You are welcome. Good to hear that it worked.
>>>>> >
>>>>> >
>>>>> > 2014-02-10 22:35 GMT+01:00 Madeleine Udell <[email protected]>:
>>>>> >
>>>>> >> fantastic, thank you. I now see Base.libblas_name="libmkl_rt", and
>>>>> am able
>>>>> >> to run the following test successfully:
>>>>> >>
>>>>> >> transa = 'N'::Base.LinAlg.BlasChar # multiply by A, not A'
>>>>> >> matdescra = "GXXF" # G = general, X = ignored, F = one-based
>>>>> indexing
>>>>> >> m,n = 50,100
>>>>> >> A = sprand(m,n,.01)
>>>>> >> y = zeros(m)
>>>>> >> x = rand(n)
>>>>> >> alpha = 1.0
>>>>> >> beta = 1.0
>>>>> >>
>>>>> >> Base.LinAlg.SparseBLAS.cscmv!(transa,alpha,matdescra,A,x,beta,y)
>>>>> >> y_builtin = A*x
>>>>> >>
>>>>> >> julia> y==y_builtin
>>>>> >> true
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Feb 10, 2014 at 12:08 PM, Andreas Noack Jensen
>>>>> >> <[email protected]> wrote:
>>>>> >>>
>>>>> >>> Hi Madeleine
>>>>> >>>
>>>>> >>> When compiling julia with MKL you'll have to do make cleanall as
>>>>> well as
>>>>> >>> rebuild ARPACK and Suitesparse, i.e. make -C distclean-arpack and
>>>>> make -C
>>>>> >>> distclean-suitesparse. It is also easier to create a Make.user
>>>>> there you set
>>>>> >>> USE_MKL=1 and MKLROOT to the location of your MKL library files,
>>>>> e.g.
>>>>> >>> /opt/intel/mkl. The arguments are explained here
>>>>> >>>
>>>>> >>>
>>>>> >>> http://software.intel.com/sites/products/documentation/hpc/
>>>>> mkl/mklman/GUID-C2EE93F0-B573-4538-A994-202CB3ADFFC2.htm#
>>>>> GUID-C2EE93F0-B573-4538-A994-202CB3ADFFC2
>>>>> >>>
>>>>> >>> but transa determines if the operation is Ax or A'x and matdescra
>>>>> has
>>>>> >>> information about the structure of the matrix, e.g. if it is
>>>>> triangular.
>>>>> >>> When you have succeeded in compiling julia with MKL, the libblas
>>>>> variable
>>>>> >>> should just be Base.libblas_name.
>>>>> >>>
>>>>> >>>
>>>>> >>> 2014-02-10 20:37 GMT+01:00 Madeleine Udell <[email protected]>:
>>>>>
>>>>> >>>
>>>>> >>>> I'm having trouble using MKL in julia. I changed Make.inc so that
>>>>> >>>> USE_MKL = 1,
>>>>> >>>> but when I make and run julia, I find that Base.libblas_name =
>>>>> >>>> "libopenblas". Is this expected? I would have thought it would be
>>>>> eg
>>>>> >>>> "libmkl_core".
>>>>> >>>>
>>>>> >>>> Andreas, I found your wrappers for MKL in this PR. I've not used
>>>>> MKL
>>>>> >>>> before, so I don't understand the call signature of those
>>>>> functions in order
>>>>> >>>> to call MKL directly. Any chance you could explain what are
>>>>> transa::BlasChar
>>>>> >>>> and matdescra::ASCIIString in the following function, and which
>>>>> mkl library
>>>>> >>>> is expected in the libblas variable? I see many .so files in the
>>>>> lib/intel64
>>>>> >>>> directory of my mkl installation, and I'm not sure which one to
>>>>> point julia
>>>>> >>>> to.
>>>>> >>>>
>>>>> >>>> function cscmv!(transa::BlasChar, α::$T, matdescra::ASCIIString,
>>>>> >>>> A::SparseMatrixCSC{$T, BlasInt}, x::StridedVector{$T}, β::$T,
>>>>> >>>> y::StridedVector{$T})
>>>>> >>>> length(x) == A.n || throw(DimensionMismatch("Matrix with $(A.n)
>>>>> columns
>>>>> >>>> multiplied with vector of length $(length(x))"))
>>>>> >>>> length(y) == A.m || throw(DimensionMismatch("Vector of length
>>>>> $(A.m)
>>>>> >>>> added to vector of length $(length(y))")) #
>>>>> >>>> ccall(($(string(mv)), libblas), Void,
>>>>> >>>> (Ptr{Uint8}, Ptr{BlasInt}, Ptr{BlasInt}, Ptr{$T},
>>>>> >>>> Ptr{Uint8}, Ptr{$T}, Ptr{BlasInt}, Ptr{BlasInt},
>>>>> >>>> Ptr{BlasInt}, Ptr{$T}, Ptr{$T}, Ptr{$T}),
>>>>> >>>> &transa, &A.m, &A.n, &α,
>>>>> >>>> matdescra, A.nzval, A.rowval, pointer(A.colptr, 1),
>>>>> >>>> pointer(A.colptr, 2), x, &β, y)
>>>>> >>>> return y
>>>>> >>>> end
>>>>> >>>>
>>>>> >>>> Thanks for your help!
>>>>> >>>> Madeleine
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Wed, Feb 5, 2014 at 1:49 PM, Andreas Noack Jensen
>>>>> >>>> <[email protected]> wrote:
>>>>> >>>>>
>>>>> >>>>> A*b will not call MKL when A is sparse. There has been some
>>>>> writing
>>>>> >>>>> about making a MKL package that overwrites
>>>>> A_mul_B(Matrix,Vector) with the
>>>>> >>>>> MKL versions and I actually wrote wrappers for the sparse MKL
>>>>> subroutines in
>>>>> >>>>> the fall for the same reason.
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> 2014-02-05 Madeleine Udell <[email protected]>:
>>>>> >>>>>
>>>>> >>>>>> Miles, you're right that writing sparse matrix vector products
>>>>> in
>>>>> >>>>>> native Julia probably won't be the best idea given Julia's
>>>>> model of
>>>>> >>>>>> parallelism. That's why I'm interested in calling an outside
>>>>> library like
>>>>> >>>>>> PETSc.
>>>>> >>>>>>
>>>>> >>>>>> I see it's possible to link Julia with MKL. I haven't tried
>>>>> this yet,
>>>>> >>>>>> but if I do, will A*b (where A is sparse) call MKL to perform
>>>>> the matrix
>>>>> >>>>>> vector product?
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> On Wed, Feb 5, 2014 at 11:43 AM, Miles Lubin <
>>>>> [email protected]>
>>>>> >>>>>> wrote:
>>>>> >>>>>>>
>>>>> >>>>>>> Memory access is typically a significant bottleneck in sparse
>>>>> >>>>>>> mat-vec, so unfortunately I'm skeptical that one could achieve
>>>>> good
>>>>> >>>>>>> performance using Julia's current distributed memory approach
>>>>> on a multicore
>>>>> >>>>>>> machine. This really calls for something like OpenMP.
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> On Wednesday, February 5, 2014 11:42:00 AM UTC-5, Madeleine
>>>>> Udell
>>>>> >>>>>>> wrote:
>>>>> >>>>>>>>
>>>>> >>>>>>>> I'm developing an iterative optimization algorithm in Julia
>>>>> along
>>>>> >>>>>>>> the lines of other contributions to the Iterative Solvers
>>>>> project or Krylov
>>>>> >>>>>>>> Subspace module whose only computationally intensive step is
>>>>> computing A*b
>>>>> >>>>>>>> or A'*b. I would like to parallelize the method by using a
>>>>> parallel sparse
>>>>> >>>>>>>> matrix vector multiply. Is there a standard backend
>>>>> matrix-vector multiply
>>>>> >>>>>>>> that's recommended in Julia if I'm targeting a shared memory
>>>>> computer with a
>>>>> >>>>>>>> large number of processors? Similarly, is there a recommended
>>>>> backend for
>>>>> >>>>>>>> targeting a cluster? My matrices can easily reach 10 million
>>>>> rows by 1
>>>>> >>>>>>>> million columns, with sparsity anywhere from .01% to problems
>>>>> that are
>>>>> >>>>>>>> nearly diagonal.
>>>>> >>>>>>>>
>>>>> >>>>>>>> I've seen many posts talking about integrating PETSc as a
>>>>> backend
>>>>> >>>>>>>> for this purpose, but it looks like the project has stalled -
>>>>> the last
>>>>> >>>>>>>> commits I see are a year ago. I'm also interested in other
>>>>> backends, eg
>>>>> >>>>>>>> Spark, SciDB, etc.
>>>>> >>>>>>>>
>>>>> >>>>>>>> I'm more interested in solving sparse problems, but as a side
>>>>> note,
>>>>> >>>>>>>> the built-in BLAS acceleration by changing the number of
>>>>> threads
>>>>> >>>>>>>> `blas_set_num_threads` works ok for dense problems using a
>>>>> moderate number
>>>>> >>>>>>>> of processors. I wonder why the number of threads isn't set
>>>>> higher than one
>>>>> >>>>>>>> by default, for example, using as many as nprocs() cores?
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> --
>>>>> >>>>>> Madeleine Udell
>>>>> >>>>>> PhD Candidate in Computational and Mathematical Engineering
>>>>> >>>>>> Stanford University
>>>>> >>>>>> www.stanford.edu/~udell
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> --
>>>>> >>>>> Med venlig hilsen
>>>>> >>>>>
>>>>> >>>>> Andreas Noack Jensen
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> --
>>>>> >>>> Madeleine Udell
>>>>> >>>> PhD Candidate in Computational and Mathematical Engineering
>>>>> >>>> Stanford University
>>>>> >>>> www.stanford.edu/~udell
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> Med venlig hilsen
>>>>> >>>
>>>>> >>> Andreas Noack Jensen
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Madeleine Udell
>>>>> >> PhD Candidate in Computational and Mathematical Engineering
>>>>> >> Stanford University
>>>>> >> www.stanford.edu/~udell
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Med venlig hilsen
>>>>> >
>>>>> > Andreas Noack Jensen
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Madeleine Udell
>>>>> PhD Candidate in Computational and Mathematical Engineering
>>>>> Stanford University
>>>>> www.stanford.edu/~udell
>>>>>
>>>>
>>>
>>>
>>> --
>>> Madeleine Udell
>>> PhD Candidate in Computational and Mathematical Engineering
>>> Stanford University
>>> www.stanford.edu/~udell
>>>
>>
>
>
> --
> Madeleine Udell
> PhD Candidate in Computational and Mathematical Engineering
> Stanford University
> www.stanford.edu/~udell
>



-- 
Madeleine Udell
PhD Candidate in Computational and Mathematical Engineering
Stanford University
www.stanford.edu/~udell

Re: [julia-users] Re: Parallel sparse matrix vector multiplication

Reply via email to