Re: [julia-users] Re: Parallel sparse matrix vector multiplication

Iain Dunning Fri, 14 Feb 2014 10:03:56 -0800

Very very cool Madeleine!


On Thursday, February 13, 2014 6:49:20 PM UTC-5, Madeleine Udell wrote:
>
> Thanks for all the advice, everyone. I've just finished a parallel sparse 
> matrix vector multiplication library written in straight julia for shared 
> memory machines, using Amit Murthy's SharedArrays. You can find it at 
> https://github.com/madeleineudell/ParallelSparseMatMul.jl 
>
> For a matrix with 1E9 nonzeros, its performance on 6 processors is the 
> same as MKL, and it retains linear parallel scaling up to the highest I've 
> tested (40 threads).
>
> (serial julia)
> julia> @time S'*x;
> elapsed time: 17.63098703 seconds (8410364 bytes allocated)
>
> (mkl, 6 threads)
> julia> @time 
> Base.LinAlg.SparseBLAS.cscmv!(transa,alpha,matdescra,A,x,beta,y);
> elapsed time: 4.334168257 seconds (48 bytes allocated)
>
> (ParallelSparseMatMul, 6 threads)
> julia> @time y = A'*x;
> elapsed time: 4.18557078 seconds (1858156 bytes allocated)
>
>
>
>
>
> On Tue, Feb 11, 2014 at 1:44 PM, Madeleine Udell 
> <[email protected]<javascript:>
> > wrote:
>
>> Thanks, Jake. It seems like I'm still not able to go above 6 threads, but 
>> that may be deep in my MKL install.
>>
>> I've begun implementing a shared sparse matrix class in native Julia that 
>> seems to scale better. The code is attached to this email.  How can I 
>> overload A'*b for my SharedBilinearOperator class, so that I can call 
>> At_mul_B on a stored copy of A', rather than computing A' each time I want 
>> to multiply by it? ie can I write something like
>>
>> ('*)(L::SharedBilinearOperator,x) = At_mul_B(L.A,x)
>>
>> ?
>>
>>
>> On Tue, Feb 11, 2014 at 8:40 AM, Jake Bolewski 
>> <[email protected]<javascript:>
>> > wrote:
>>
>>> Hey Madeleine,
>>>
>>> First I would check that your global environment 
>>> variables<http://software.intel.com/sites/products/documentation/hpc/mkl/mkl_userguide_lnx/GUID-0DE6A77B-00E0-4ED6-9CAE-52FCF49E5623.htm>are
>>>  set up correctly.  If you want to set up the number of threads 
>>> programmatically:
>>>
>>>    _       _ _(_)_     |  A fresh approach to technical computing
>>>   (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
>>>    _ _   _| |_  __ _   |  Type "help()" to list help topics
>>>   | | | | | | |/ _` |  |
>>>   | | |_| | | | (_| |  |  Version 0.3.0-prerelease+1470 (2014-02-08 
>>> 16:23 UTC)
>>>  _/ |\__'_|_|_|\__'_|  |  Commit 596d5c4* (3 days old master)
>>> |__/                   |  x86_64-unknown-linux-gnu
>>>
>>> julia> Base.blas_vendor()
>>> :mkl
>>>
>>> julia> Base.blas_set_num_threads(12)
>>>
>>>
>>> You can see the relevant code here: 
>>> https://github.com/JuliaLang/julia/blob/8ac5a7f7fff1c54a768c7bc9ae85cf053d310f42/base/util.jl#L293
>>> .
>>> It's always worth a quick search through the code base to figure out 
>>> what is going on behind the scenes.
>>>
>>> Hope that helps!
>>> Jake
>>>
>>> On Tuesday, February 11, 2014 12:34:56 AM UTC-5, Madeleine Udell wrote:
>>>>
>>>> Jake, thanks for the 
>>>> reference<http://software.intel.com/en-us/forums/topic/294954>; 
>>>> I have 32 hyperthreaded cores, so there's definitely something else going 
>>>> on to limit me to 6 in addition to not exploiting hyperthreading.
>>>>
>>>  
>>>
>>>> Perhaps I need to call something like omp_set_num_threads()? But there 
>>>> doesn't seem to be a function by this name in the libmkl_rt library.
>>>>
>>>> julia> ccall((:omp_set_num_threads,Base.libblas_name),Ptr{Void},(
>>>> Uint8,),32)
>>>> ERROR: ccall: could not find function omp_set_num_threads in library 
>>>> libmkl_rt
>>>>  in anonymous at no file
>>>>
>>>>
>>>> On Mon, Feb 10, 2014 at 4:05 PM, Jake Bolewski <[email protected]>wrote:
>>>>
>>>>> Are those hyper-threaded "cores"?  If so, check Intel MKL's 
>>>>> documentation on hyper-threading.
>>>>>
>>>>> -Best
>>>>> Jake 
>>>>>
>>>>> On Monday, February 10, 2014 6:38:50 PM UTC-5, Madeleine Udell wrote:
>>>>>
>>>>>> It looks like only 6 threads are being used when I use mkl from 
>>>>>> julia. 
>>>>>> If I do blas_set_num_threads(n), then using top, I see julia is 
>>>>>> running at min(n,6)*100% cpu. Any idea why this would be, or how to 
>>>>>> get mkl to use more threads? I'm not sure if this is an issue in 
>>>>>> julia 
>>>>>> or with my installation of mkl. 
>>>>>>
>>>>>> On Mon, Feb 10, 2014 at 2:09 PM, Andreas Noack Jensen 
>>>>>> <[email protected]> wrote: 
>>>>>> > You are welcome. Good to hear that it worked. 
>>>>>> > 
>>>>>> > 
>>>>>> > 2014-02-10 22:35 GMT+01:00 Madeleine Udell <[email protected]>: 
>>>>>> > 
>>>>>> >> fantastic, thank you. I now see Base.libblas_name="libmkl_rt", and 
>>>>>> am able 
>>>>>> >> to run the following test successfully: 
>>>>>> >> 
>>>>>> >> transa = 'N'::Base.LinAlg.BlasChar # multiply by A, not A' 
>>>>>> >> matdescra = "GXXF" # G = general, X = ignored, F = one-based 
>>>>>> indexing 
>>>>>> >> m,n = 50,100 
>>>>>> >> A = sprand(m,n,.01) 
>>>>>> >> y = zeros(m) 
>>>>>> >> x = rand(n) 
>>>>>> >> alpha = 1.0 
>>>>>> >> beta = 1.0 
>>>>>> >> 
>>>>>> >> Base.LinAlg.SparseBLAS.cscmv!(transa,alpha,matdescra,A,x,beta,y) 
>>>>>> >> y_builtin = A*x 
>>>>>> >> 
>>>>>> >> julia> y==y_builtin 
>>>>>> >> true 
>>>>>> >> 
>>>>>> >> 
>>>>>> >> On Mon, Feb 10, 2014 at 12:08 PM, Andreas Noack Jensen 
>>>>>> >> <[email protected]> wrote: 
>>>>>> >>> 
>>>>>> >>> Hi Madeleine 
>>>>>> >>> 
>>>>>> >>> When compiling julia with MKL you'll have to do make cleanall as 
>>>>>> well as 
>>>>>> >>> rebuild ARPACK and Suitesparse, i.e. make -C distclean-arpack and 
>>>>>> make -C 
>>>>>> >>> distclean-suitesparse. It is also easier to create a Make.user 
>>>>>> there you set 
>>>>>> >>> USE_MKL=1 and MKLROOT to the location of your MKL library files, 
>>>>>> e.g. 
>>>>>> >>> /opt/intel/mkl. The arguments are explained here 
>>>>>> >>> 
>>>>>> >>> 
>>>>>> >>> http://software.intel.com/sites/products/documentation/hpc/
>>>>>> mkl/mklman/GUID-C2EE93F0-B573-4538-A994-202CB3ADFFC2.htm#
>>>>>> GUID-C2EE93F0-B573-4538-A994-202CB3ADFFC2 
>>>>>> >>> 
>>>>>> >>> but transa determines if the operation is Ax or A'x and matdescra 
>>>>>> has 
>>>>>> >>> information about the structure of the matrix, e.g. if it is 
>>>>>> triangular. 
>>>>>> >>> When you have succeeded in compiling julia with MKL, the libblas 
>>>>>> variable 
>>>>>> >>> should just be Base.libblas_name. 
>>>>>> >>> 
>>>>>> >>> 
>>>>>> >>> 2014-02-10 20:37 GMT+01:00 Madeleine Udell <[email protected]>: 
>>>>>>
>>>>>> >>> 
>>>>>> >>>> I'm having trouble using MKL in julia. I changed Make.inc so 
>>>>>> that 
>>>>>> >>>> USE_MKL = 1, 
>>>>>> >>>> but when I make and run julia, I find that Base.libblas_name = 
>>>>>> >>>> "libopenblas". Is this expected? I would have thought it would 
>>>>>> be eg 
>>>>>> >>>> "libmkl_core". 
>>>>>> >>>> 
>>>>>> >>>> Andreas, I found your wrappers for MKL in this PR. I've not used 
>>>>>> MKL 
>>>>>> >>>> before, so I don't understand the call signature of those 
>>>>>> functions in order 
>>>>>> >>>> to call MKL directly. Any chance you could explain what are 
>>>>>> transa::BlasChar 
>>>>>> >>>> and matdescra::ASCIIString in the following function, and which 
>>>>>> mkl library 
>>>>>> >>>> is expected in the libblas variable? I see many .so files in the 
>>>>>> lib/intel64 
>>>>>> >>>> directory of my mkl installation, and I'm not sure which one to 
>>>>>> point julia 
>>>>>> >>>> to. 
>>>>>> >>>> 
>>>>>> >>>> function cscmv!(transa::BlasChar, α::$T, matdescra::ASCIIString, 
>>>>>> >>>> A::SparseMatrixCSC{$T, BlasInt}, x::StridedVector{$T}, β::$T, 
>>>>>> >>>> y::StridedVector{$T}) 
>>>>>> >>>> length(x) == A.n || throw(DimensionMismatch("Matrix with $(A.n) 
>>>>>> columns 
>>>>>> >>>> multiplied with vector of length $(length(x))")) 
>>>>>> >>>> length(y) == A.m || throw(DimensionMismatch("Vector of length 
>>>>>> $(A.m) 
>>>>>> >>>> added to vector of length $(length(y))")) # 
>>>>>> >>>> ccall(($(string(mv)), libblas), Void, 
>>>>>> >>>> (Ptr{Uint8}, Ptr{BlasInt}, Ptr{BlasInt}, Ptr{$T}, 
>>>>>> >>>> Ptr{Uint8}, Ptr{$T}, Ptr{BlasInt}, Ptr{BlasInt}, 
>>>>>> >>>> Ptr{BlasInt}, Ptr{$T}, Ptr{$T}, Ptr{$T}), 
>>>>>> >>>> &transa, &A.m, &A.n, &α, 
>>>>>> >>>> matdescra, A.nzval, A.rowval, pointer(A.colptr, 1), 
>>>>>> >>>> pointer(A.colptr, 2), x, &β, y) 
>>>>>> >>>> return y 
>>>>>> >>>> end 
>>>>>> >>>> 
>>>>>> >>>> Thanks for your help! 
>>>>>> >>>> Madeleine 
>>>>>> >>>> 
>>>>>> >>>> 
>>>>>> >>>> On Wed, Feb 5, 2014 at 1:49 PM, Andreas Noack Jensen 
>>>>>> >>>> <[email protected]> wrote: 
>>>>>> >>>>> 
>>>>>> >>>>> A*b will not call MKL when A is sparse. There has been some 
>>>>>> writing 
>>>>>> >>>>> about making a MKL package that overwrites 
>>>>>> A_mul_B(Matrix,Vector) with the 
>>>>>> >>>>> MKL versions and I actually wrote wrappers for the sparse MKL 
>>>>>> subroutines in 
>>>>>> >>>>> the fall for the same reason. 
>>>>>> >>>>> 
>>>>>> >>>>> 
>>>>>> >>>>> 2014-02-05 Madeleine Udell <[email protected]>: 
>>>>>> >>>>> 
>>>>>> >>>>>> Miles, you're right that writing sparse matrix vector products 
>>>>>> in 
>>>>>> >>>>>> native Julia probably won't be the best idea given Julia's 
>>>>>> model of 
>>>>>> >>>>>> parallelism. That's why I'm interested in calling an outside 
>>>>>> library like 
>>>>>> >>>>>> PETSc. 
>>>>>> >>>>>> 
>>>>>> >>>>>> I see it's possible to link Julia with MKL. I haven't tried 
>>>>>> this yet, 
>>>>>> >>>>>> but if I do, will A*b (where A is sparse) call MKL to perform 
>>>>>> the matrix 
>>>>>> >>>>>> vector product? 
>>>>>> >>>>>> 
>>>>>> >>>>>> 
>>>>>> >>>>>> On Wed, Feb 5, 2014 at 11:43 AM, Miles Lubin <
>>>>>> [email protected]> 
>>>>>> >>>>>> wrote: 
>>>>>> >>>>>>> 
>>>>>> >>>>>>> Memory access is typically a significant bottleneck in sparse 
>>>>>> >>>>>>> mat-vec, so unfortunately I'm skeptical that one could 
>>>>>> achieve good 
>>>>>> >>>>>>> performance using Julia's current distributed memory approach 
>>>>>> on a multicore 
>>>>>> >>>>>>> machine. This really calls for something like OpenMP. 
>>>>>> >>>>>>> 
>>>>>> >>>>>>> 
>>>>>> >>>>>>> On Wednesday, February 5, 2014 11:42:00 AM UTC-5, Madeleine 
>>>>>> Udell 
>>>>>> >>>>>>> wrote: 
>>>>>> >>>>>>>> 
>>>>>> >>>>>>>> I'm developing an iterative optimization algorithm in Julia 
>>>>>> along 
>>>>>> >>>>>>>> the lines of other contributions to the Iterative Solvers 
>>>>>> project or Krylov 
>>>>>> >>>>>>>> Subspace module whose only computationally intensive step is 
>>>>>> computing A*b 
>>>>>> >>>>>>>> or A'*b. I would like to parallelize the method by using a 
>>>>>> parallel sparse 
>>>>>> >>>>>>>> matrix vector multiply. Is there a standard backend 
>>>>>> matrix-vector multiply 
>>>>>> >>>>>>>> that's recommended in Julia if I'm targeting a shared memory 
>>>>>> computer with a 
>>>>>> >>>>>>>> large number of processors? Similarly, is there a 
>>>>>> recommended backend for 
>>>>>> >>>>>>>> targeting a cluster? My matrices can easily reach 10 million 
>>>>>> rows by 1 
>>>>>> >>>>>>>> million columns, with sparsity anywhere from .01% to 
>>>>>> problems that are 
>>>>>> >>>>>>>> nearly diagonal. 
>>>>>> >>>>>>>> 
>>>>>> >>>>>>>> I've seen many posts talking about integrating PETSc as a 
>>>>>> backend 
>>>>>> >>>>>>>> for this purpose, but it looks like the project has stalled 
>>>>>> - the last 
>>>>>> >>>>>>>> commits I see are a year ago. I'm also interested in other 
>>>>>> backends, eg 
>>>>>> >>>>>>>> Spark, SciDB, etc. 
>>>>>> >>>>>>>> 
>>>>>> >>>>>>>> I'm more interested in solving sparse problems, but as a 
>>>>>> side note, 
>>>>>> >>>>>>>> the built-in BLAS acceleration by changing the number of 
>>>>>> threads 
>>>>>> >>>>>>>> `blas_set_num_threads` works ok for dense problems using a 
>>>>>> moderate number 
>>>>>> >>>>>>>> of processors. I wonder why the number of threads isn't set 
>>>>>> higher than one 
>>>>>> >>>>>>>> by default, for example, using as many as nprocs() cores? 
>>>>>> >>>>>> 
>>>>>> >>>>>> 
>>>>>> >>>>>> 
>>>>>> >>>>>> 
>>>>>> >>>>>> -- 
>>>>>> >>>>>> Madeleine Udell 
>>>>>> >>>>>> PhD Candidate in Computational and Mathematical Engineering 
>>>>>> >>>>>> Stanford University 
>>>>>> >>>>>> www.stanford.edu/~udell 
>>>>>> >>>>> 
>>>>>> >>>>> 
>>>>>> >>>>> 
>>>>>> >>>>> 
>>>>>> >>>>> -- 
>>>>>> >>>>> Med venlig hilsen 
>>>>>> >>>>> 
>>>>>> >>>>> Andreas Noack Jensen 
>>>>>> >>>> 
>>>>>> >>>> 
>>>>>> >>>> 
>>>>>> >>>> 
>>>>>> >>>> -- 
>>>>>> >>>> Madeleine Udell 
>>>>>> >>>> PhD Candidate in Computational and Mathematical Engineering 
>>>>>> >>>> Stanford University 
>>>>>> >>>> www.stanford.edu/~udell 
>>>>>> >>> 
>>>>>> >>> 
>>>>>> >>> 
>>>>>> >>> 
>>>>>> >>> -- 
>>>>>> >>> Med venlig hilsen 
>>>>>> >>> 
>>>>>> >>> Andreas Noack Jensen 
>>>>>> >> 
>>>>>> >> 
>>>>>> >> 
>>>>>> >> 
>>>>>> >> -- 
>>>>>> >> Madeleine Udell 
>>>>>> >> PhD Candidate in Computational and Mathematical Engineering 
>>>>>> >> Stanford University 
>>>>>> >> www.stanford.edu/~udell 
>>>>>> > 
>>>>>> > 
>>>>>> > 
>>>>>> > 
>>>>>> > -- 
>>>>>> > Med venlig hilsen 
>>>>>> > 
>>>>>> > Andreas Noack Jensen 
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Madeleine Udell 
>>>>>> PhD Candidate in Computational and Mathematical Engineering 
>>>>>> Stanford University 
>>>>>> www.stanford.edu/~udell 
>>>>>>
>>>>>
>>>>
>>>>
>>>> -- 
>>>> Madeleine Udell
>>>> PhD Candidate in Computational and Mathematical Engineering
>>>> Stanford University
>>>> www.stanford.edu/~udell
>>>>  
>>>
>>
>>
>> -- 
>> Madeleine Udell
>> PhD Candidate in Computational and Mathematical Engineering
>> Stanford University
>> www.stanford.edu/~udell
>>  
>
>
>
> -- 
> Madeleine Udell
> PhD Candidate in Computational and Mathematical Engineering
> Stanford University
> www.stanford.edu/~udell
>

Re: [julia-users] Re: Parallel sparse matrix vector multiplication

Reply via email to