Re: [julia-users] Re: Parallel sparse matrix vector multiplication

Miles Lubin Thu, 13 Feb 2014 16:06:10 -0800

Nice!


On Thu, Feb 13, 2014 at 6:49 PM, Madeleine Udell
<[email protected]>wrote:

> Thanks for all the advice, everyone. I've just finished a parallel sparse
> matrix vector multiplication library written in straight julia for shared
> memory machines, using Amit Murthy's SharedArrays. You can find it at
> https://github.com/madeleineudell/ParallelSparseMatMul.jl
>
> For a matrix with 1E9 nonzeros, its performance on 6 processors is the
> same as MKL, and it retains linear parallel scaling up to the highest I've
> tested (40 threads).
>
> (serial julia)
> julia> @time S'*x;
> elapsed time: 17.63098703 seconds (8410364 bytes allocated)
>
> (mkl, 6 threads)
> julia> @time
> Base.LinAlg.SparseBLAS.cscmv!(transa,alpha,matdescra,A,x,beta,y);
> elapsed time: 4.334168257 seconds (48 bytes allocated)
>
> (ParallelSparseMatMul, 6 threads)
> julia> @time y = A'*x;
> elapsed time: 4.18557078 seconds (1858156 bytes allocated)
>
>
>
>
>
> On Tue, Feb 11, 2014 at 1:44 PM, Madeleine Udell <
> [email protected]> wrote:
>
>> Thanks, Jake. It seems like I'm still not able to go above 6 threads, but
>> that may be deep in my MKL install.
>>
>> I've begun implementing a shared sparse matrix class in native Julia that
>> seems to scale better. The code is attached to this email.  How can I
>> overload A'*b for my SharedBilinearOperator class, so that I can call
>> At_mul_B on a stored copy of A', rather than computing A' each time I want
>> to multiply by it? ie can I write something like
>>
>> ('*)(L::SharedBilinearOperator,x) = At_mul_B(L.A,x)
>>
>> ?
>>
>>
>> On Tue, Feb 11, 2014 at 8:40 AM, Jake Bolewski <[email protected]>wrote:
>>
>>> Hey Madeleine,
>>>
>>> First I would check that your global environment 
>>> variables<http://software.intel.com/sites/products/documentation/hpc/mkl/mkl_userguide_lnx/GUID-0DE6A77B-00E0-4ED6-9CAE-52FCF49E5623.htm>are
>>>  set up correctly.  If you want to set up the number of threads
>>> programmatically:
>>>
>>>    _       _ _(_)_     |  A fresh approach to technical computing
>>>   (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
>>>    _ _   _| |_  __ _   |  Type "help()" to list help topics
>>>   | | | | | | |/ _` |  |
>>>   | | |_| | | | (_| |  |  Version 0.3.0-prerelease+1470 (2014-02-08
>>> 16:23 UTC)
>>>  _/ |\__'_|_|_|\__'_|  |  Commit 596d5c4* (3 days old master)
>>> |__/                   |  x86_64-unknown-linux-gnu
>>>
>>> julia> Base.blas_vendor()
>>> :mkl
>>>
>>> julia> Base.blas_set_num_threads(12)
>>>
>>>
>>> You can see the relevant code here:
>>> https://github.com/JuliaLang/julia/blob/8ac5a7f7fff1c54a768c7bc9ae85cf053d310f42/base/util.jl#L293
>>> .
>>> It's always worth a quick search through the code base to figure out
>>> what is going on behind the scenes.
>>>
>>> Hope that helps!
>>> Jake
>>>
>>> On Tuesday, February 11, 2014 12:34:56 AM UTC-5, Madeleine Udell wrote:
>>>>
>>>> Jake, thanks for the 
>>>> reference<http://software.intel.com/en-us/forums/topic/294954>;
>>>> I have 32 hyperthreaded cores, so there's definitely something else going
>>>> on to limit me to 6 in addition to not exploiting hyperthreading.
>>>>
>>>
>>>
>>>> Perhaps I need to call something like omp_set_num_threads()? But there
>>>> doesn't seem to be a function by this name in the libmkl_rt library.
>>>>
>>>> julia> ccall((:omp_set_num_threads,Base.libblas_name),Ptr{Void},(
>>>> Uint8,),32)
>>>> ERROR: ccall: could not find function omp_set_num_threads in library
>>>> libmkl_rt
>>>>  in anonymous at no file
>>>>
>>>>
>>>> On Mon, Feb 10, 2014 at 4:05 PM, Jake Bolewski <[email protected]>wrote:
>>>>
>>>>> Are those hyper-threaded "cores"?  If so, check Intel MKL's
>>>>> documentation on hyper-threading.
>>>>>
>>>>> -Best
>>>>> Jake
>>>>>
>>>>> On Monday, February 10, 2014 6:38:50 PM UTC-5, Madeleine Udell wrote:
>>>>>
>>>>>> It looks like only 6 threads are being used when I use mkl from
>>>>>> julia.
>>>>>> If I do blas_set_num_threads(n), then using top, I see julia is
>>>>>> running at min(n,6)*100% cpu. Any idea why this would be, or how to
>>>>>> get mkl to use more threads? I'm not sure if this is an issue in
>>>>>> julia
>>>>>> or with my installation of mkl.
>>>>>>
>>>>>> On Mon, Feb 10, 2014 at 2:09 PM, Andreas Noack Jensen
>>>>>> <[email protected]> wrote:
>>>>>> > You are welcome. Good to hear that it worked.
>>>>>> >
>>>>>> >
>>>>>> > 2014-02-10 22:35 GMT+01:00 Madeleine Udell <[email protected]>:
>>>>>> >
>>>>>> >> fantastic, thank you. I now see Base.libblas_name="libmkl_rt", and
>>>>>> am able
>>>>>> >> to run the following test successfully:
>>>>>> >>
>>>>>> >> transa = 'N'::Base.LinAlg.BlasChar # multiply by A, not A'
>>>>>> >> matdescra = "GXXF" # G = general, X = ignored, F = one-based
>>>>>> indexing
>>>>>> >> m,n = 50,100
>>>>>> >> A = sprand(m,n,.01)
>>>>>> >> y = zeros(m)
>>>>>> >> x = rand(n)
>>>>>> >> alpha = 1.0
>>>>>> >> beta = 1.0
>>>>>> >>
>>>>>> >> Base.LinAlg.SparseBLAS.cscmv!(transa,alpha,matdescra,A,x,beta,y)
>>>>>> >> y_builtin = A*x
>>>>>> >>
>>>>>> >> julia> y==y_builtin
>>>>>> >> true
>>>>>> >>
>>>>>> >>
>>>>>> >> On Mon, Feb 10, 2014 at 12:08 PM, Andreas Noack Jensen
>>>>>> >> <[email protected]> wrote:
>>>>>> >>>
>>>>>> >>> Hi Madeleine
>>>>>> >>>
>>>>>> >>> When compiling julia with MKL you'll have to do make cleanall as
>>>>>> well as
>>>>>> >>> rebuild ARPACK and Suitesparse, i.e. make -C distclean-arpack and
>>>>>> make -C
>>>>>> >>> distclean-suitesparse. It is also easier to create a Make.user
>>>>>> there you set
>>>>>> >>> USE_MKL=1 and MKLROOT to the location of your MKL library files,
>>>>>> e.g.
>>>>>> >>> /opt/intel/mkl. The arguments are explained here
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> http://software.intel.com/sites/products/documentation/hpc/
>>>>>> mkl/mklman/GUID-C2EE93F0-B573-4538-A994-202CB3ADFFC2.htm#
>>>>>> GUID-C2EE93F0-B573-4538-A994-202CB3ADFFC2
>>>>>> >>>
>>>>>> >>> but transa determines if the operation is Ax or A'x and matdescra
>>>>>> has
>>>>>> >>> information about the structure of the matrix, e.g. if it is
>>>>>> triangular.
>>>>>> >>> When you have succeeded in compiling julia with MKL, the libblas
>>>>>> variable
>>>>>> >>> should just be Base.libblas_name.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> 2014-02-10 20:37 GMT+01:00 Madeleine Udell <[email protected]>:
>>>>>>
>>>>>> >>>
>>>>>> >>>> I'm having trouble using MKL in julia. I changed Make.inc so
>>>>>> that
>>>>>> >>>> USE_MKL = 1,
>>>>>> >>>> but when I make and run julia, I find that Base.libblas_name =
>>>>>> >>>> "libopenblas". Is this expected? I would have thought it would
>>>>>> be eg
>>>>>> >>>> "libmkl_core".
>>>>>> >>>>
>>>>>> >>>> Andreas, I found your wrappers for MKL in this PR. I've not used
>>>>>> MKL
>>>>>> >>>> before, so I don't understand the call signature of those
>>>>>> functions in order
>>>>>> >>>> to call MKL directly. Any chance you could explain what are
>>>>>> transa::BlasChar
>>>>>> >>>> and matdescra::ASCIIString in the following function, and which
>>>>>> mkl library
>>>>>> >>>> is expected in the libblas variable? I see many .so files in the
>>>>>> lib/intel64
>>>>>> >>>> directory of my mkl installation, and I'm not sure which one to
>>>>>> point julia
>>>>>> >>>> to.
>>>>>> >>>>
>>>>>> >>>> function cscmv!(transa::BlasChar, α::$T, matdescra::ASCIIString,
>>>>>> >>>> A::SparseMatrixCSC{$T, BlasInt}, x::StridedVector{$T}, β::$T,
>>>>>> >>>> y::StridedVector{$T})
>>>>>> >>>> length(x) == A.n || throw(DimensionMismatch("Matrix with $(A.n)
>>>>>> columns
>>>>>> >>>> multiplied with vector of length $(length(x))"))
>>>>>> >>>> length(y) == A.m || throw(DimensionMismatch("Vector of length
>>>>>> $(A.m)
>>>>>> >>>> added to vector of length $(length(y))")) #
>>>>>> >>>> ccall(($(string(mv)), libblas), Void,
>>>>>> >>>> (Ptr{Uint8}, Ptr{BlasInt}, Ptr{BlasInt}, Ptr{$T},
>>>>>> >>>> Ptr{Uint8}, Ptr{$T}, Ptr{BlasInt}, Ptr{BlasInt},
>>>>>> >>>> Ptr{BlasInt}, Ptr{$T}, Ptr{$T}, Ptr{$T}),
>>>>>> >>>> &transa, &A.m, &A.n, &α,
>>>>>> >>>> matdescra, A.nzval, A.rowval, pointer(A.colptr, 1),
>>>>>> >>>> pointer(A.colptr, 2), x, &β, y)
>>>>>> >>>> return y
>>>>>> >>>> end
>>>>>> >>>>
>>>>>> >>>> Thanks for your help!
>>>>>> >>>> Madeleine
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> On Wed, Feb 5, 2014 at 1:49 PM, Andreas Noack Jensen
>>>>>> >>>> <[email protected]> wrote:
>>>>>> >>>>>
>>>>>> >>>>> A*b will not call MKL when A is sparse. There has been some
>>>>>> writing
>>>>>> >>>>> about making a MKL package that overwrites
>>>>>> A_mul_B(Matrix,Vector) with the
>>>>>> >>>>> MKL versions and I actually wrote wrappers for the sparse MKL
>>>>>> subroutines in
>>>>>> >>>>> the fall for the same reason.
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>> 2014-02-05 Madeleine Udell <[email protected]>:
>>>>>> >>>>>
>>>>>> >>>>>> Miles, you're right that writing sparse matrix vector products
>>>>>> in
>>>>>> >>>>>> native Julia probably won't be the best idea given Julia's
>>>>>> model of
>>>>>> >>>>>> parallelism. That's why I'm interested in calling an outside
>>>>>> library like
>>>>>> >>>>>> PETSc.
>>>>>> >>>>>>
>>>>>> >>>>>> I see it's possible to link Julia with MKL. I haven't tried
>>>>>> this yet,
>>>>>> >>>>>> but if I do, will A*b (where A is sparse) call MKL to perform
>>>>>> the matrix
>>>>>> >>>>>> vector product?
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>> On Wed, Feb 5, 2014 at 11:43 AM, Miles Lubin <
>>>>>> [email protected]>
>>>>>> >>>>>> wrote:
>>>>>> >>>>>>>
>>>>>> >>>>>>> Memory access is typically a significant bottleneck in sparse
>>>>>> >>>>>>> mat-vec, so unfortunately I'm skeptical that one could
>>>>>> achieve good
>>>>>> >>>>>>> performance using Julia's current distributed memory approach
>>>>>> on a multicore
>>>>>> >>>>>>> machine. This really calls for something like OpenMP.
>>>>>> >>>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>> On Wednesday, February 5, 2014 11:42:00 AM UTC-5, Madeleine
>>>>>> Udell
>>>>>> >>>>>>> wrote:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> I'm developing an iterative optimization algorithm in Julia
>>>>>> along
>>>>>> >>>>>>>> the lines of other contributions to the Iterative Solvers
>>>>>> project or Krylov
>>>>>> >>>>>>>> Subspace module whose only computationally intensive step is
>>>>>> computing A*b
>>>>>> >>>>>>>> or A'*b. I would like to parallelize the method by using a
>>>>>> parallel sparse
>>>>>> >>>>>>>> matrix vector multiply. Is there a standard backend
>>>>>> matrix-vector multiply
>>>>>> >>>>>>>> that's recommended in Julia if I'm targeting a shared memory
>>>>>> computer with a
>>>>>> >>>>>>>> large number of processors? Similarly, is there a
>>>>>> recommended backend for
>>>>>> >>>>>>>> targeting a cluster? My matrices can easily reach 10 million
>>>>>> rows by 1
>>>>>> >>>>>>>> million columns, with sparsity anywhere from .01% to
>>>>>> problems that are
>>>>>> >>>>>>>> nearly diagonal.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> I've seen many posts talking about integrating PETSc as a
>>>>>> backend
>>>>>> >>>>>>>> for this purpose, but it looks like the project has stalled
>>>>>> - the last
>>>>>> >>>>>>>> commits I see are a year ago. I'm also interested in other
>>>>>> backends, eg
>>>>>> >>>>>>>> Spark, SciDB, etc.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> I'm more interested in solving sparse problems, but as a
>>>>>> side note,
>>>>>> >>>>>>>> the built-in BLAS acceleration by changing the number of
>>>>>> threads
>>>>>> >>>>>>>> `blas_set_num_threads` works ok for dense problems using a
>>>>>> moderate number
>>>>>> >>>>>>>> of processors. I wonder why the number of threads isn't set
>>>>>> higher than one
>>>>>> >>>>>>>> by default, for example, using as many as nprocs() cores?
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>> --
>>>>>> >>>>>> Madeleine Udell
>>>>>> >>>>>> PhD Candidate in Computational and Mathematical Engineering
>>>>>> >>>>>> Stanford University
>>>>>> >>>>>> www.stanford.edu/~udell
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>> --
>>>>>> >>>>> Med venlig hilsen
>>>>>> >>>>>
>>>>>> >>>>> Andreas Noack Jensen
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> --
>>>>>> >>>> Madeleine Udell
>>>>>> >>>> PhD Candidate in Computational and Mathematical Engineering
>>>>>> >>>> Stanford University
>>>>>> >>>> www.stanford.edu/~udell
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> --
>>>>>> >>> Med venlig hilsen
>>>>>> >>>
>>>>>> >>> Andreas Noack Jensen
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> Madeleine Udell
>>>>>> >> PhD Candidate in Computational and Mathematical Engineering
>>>>>> >> Stanford University
>>>>>> >> www.stanford.edu/~udell
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Med venlig hilsen
>>>>>> >
>>>>>> > Andreas Noack Jensen
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Madeleine Udell
>>>>>> PhD Candidate in Computational and Mathematical Engineering
>>>>>> Stanford University
>>>>>> www.stanford.edu/~udell
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Madeleine Udell
>>>> PhD Candidate in Computational and Mathematical Engineering
>>>> Stanford University
>>>> www.stanford.edu/~udell
>>>>
>>>
>>
>>
>> --
>> Madeleine Udell
>> PhD Candidate in Computational and Mathematical Engineering
>> Stanford University
>> www.stanford.edu/~udell
>>
>
>
>
> --
> Madeleine Udell
> PhD Candidate in Computational and Mathematical Engineering
> Stanford University
> www.stanford.edu/~udell
>

Re: [julia-users] Re: Parallel sparse matrix vector multiplication

Reply via email to