Le mardi 11 novembre 2014 à 09:54 -0500, Joshua Tokle a écrit :
> I think I see: you're suggesting a single loop that performs the
> a*b*(c+d+e) operation element-wise, and this will prevent the
> allocation of intermediate results. Is that right?
Yes.

> Yes, the sparsity pattern will differ between the vectors I'm
> operating on, but maybe I can adapt the code from sparsematrix.jl to
> my special case.
Indeed, this makes my proposition more challenging. A general algorithm
to combine several sparse matrices would be very useful.


Regards

> On Tue, Nov 11, 2014 at 9:43 AM, Milan Bouchet-Valat
> <[email protected]> wrote:
>         Le mardi 11 novembre 2014 à 04:52 -0800, Joshua Tokle a
>         écrit :
>         > Thanks for your response. Because the operations are on
>         sparse
>         > matrices I'm pretty sure the arithmetic is already more
>         optimized than
>         > something I would write:
>         >
>         >
>         
> https://github.com/JuliaLang/julia/blob/master/base/sparse/sparsematrix.jl#L530
>         > I did actually spend some time searching for sparse matrix
>         arithmetic
>         > algorithms, but I didn't come up with anything that seemed
>         like it
>         > would be an improvement.
>         Of course the algorithm is optimized for sparse matrices, but
>         every call
>         to + or .* creates a copy of the matrix, which is not
>         efficient at all.
>         
>         > It seems that the issue may be with garbage collection. I'm
>         going to
>         > post a top-level reply with more on this.
>         I'm not an expert of sparse matrices, but one efficient way of
>         avoid
>         temporary copies and garbage collection is to manually unroll
>         your loop,
>         as I advised in my first message. Now I must admit it's more
>         complex
>         than it seems if the structure of nonzeros is not always the
>         same in the
>         various matrices you want to combine. Is that the case?
>         
>         
>         Regards
>         
>         
>         Regards
>         
>         > On Monday, November 10, 2014 5:05:29 PM UTC-5, Milan
>         Bouchet-Valat
>         > wrote:
>         >         Le lundi 10 novembre 2014 à 13:03 -0800, Joshua
>         Tokle a
>         >         écrit :
>         >         > Hello! I'm trying to replace an existing matlab
>         code with
>         >         julia and
>         >         > I'm having trouble matching the performance of the
>         original
>         >         code. The
>         >         > matlab code is here:
>         >         >
>         >         >
>         >
>          https://github.com/jotok/InventorDisambiguator/blob/julia/Disambig.m
>         >         >
>         >         > The program clusters inventors from a database of
>         patent
>         >         applications.
>         >         > The input data is a sparse boolean matrix (named
>         XX in the
>         >         script),
>         >         > where each row defines an inventor and each column
>         defines a
>         >         feature.
>         >         > For example, the jth column might correspond to a
>         feature
>         >         "first name
>         >         > is John". If there is a 1 in the XX[i, j], this
>         means that
>         >         inventor
>         >         > i's first name is John. Given an inventor i, we
>         find similar
>         >         inventors
>         >         > by identifying rows in the matrix that agree with
>         XX[i, :]
>         >         on a given
>         >         > column and then applying element-wise boolean
>         operations to
>         >         the rows.
>         >         > In the code, for a given value of `index`,
>         C_lastname holds
>         >         the unique
>         >         > column in XX corresponding to a "last name"
>         feature such
>         >         that
>         >         > XX[index, :] equals 1. C_firstname holds the
>         unique column
>         >         in XX
>         >         > corresponding to a "first name" feature such that
>         >         XX[index, :] equals
>         >         > 1. And so on. The following code snippet finds all
>         rows in
>         >         the matrix
>         >         > that agree with XX[index, :] on full name and one
>         of patent
>         >         assignee
>         >         > name, inventory city, or patent class:
>         >         >
>         >         >     lump_index_2 = step & ((C_assignee | C_city |
>         C_class))
>         >         >
>         >         > The `step` variable is an indicator that's used to
>         prevent
>         >         the same
>         >         > inventors from being considered multiple times. My
>         attempt
>         >         at a
>         >         > literal translation of this code to julia is here:
>         >         >
>         >         >
>         >
>          https://github.com/jotok/InventorDisambiguator/blob/julia/disambig.jl
>         >         >
>         >         > The matrix X is of type SparseMatrixCSC{Int64,
>         Int64}.
>         >         Boolean
>         >         > operations aren't supported for sparse matrices in
>         julia, so
>         >         I fake it
>         >         > with integer arithmetic.  The line that
>         corresponds to the
>         >         matlab code
>         >         > above is
>         >         >
>         >         >     lump_index_2 = find(step .* (C_name .*
>         (C_assignee +
>         >         C_city + C_class)))
>         >         You should be able to get a speedup by replacing
>         this line
>         >         with an
>         >         explicit `for` loop. First, you'll avoid memory
>         allocation
>         >         (one for each
>         >         + or .* operation). Second, you'll be able to return
>         as soon
>         >         as the
>         >         index is found, instead of computing the value for
>         all
>         >         elements (IIUC
>         >         you're only looking for one index, right?).
>         >
>         >
>         >         My two cents
>         >
>         >         > The reason I grouped it this way is that initially
>         `step`
>         >         will be a
>         >         > "sparse" vector of all 1's, and I thought it might
>         help to
>         >         do the
>         >         > truly sparse arithmetic first.
>         >         >
>         >         > I've been testing this code on a Windows 2008
>         Server. The
>         >         test data
>         >         > contains 45,763 inventors and 274,578 possible
>         features (in
>         >         other
>         >         > words, XX is an 45,763 x 274,58 sparse matrix).
>         The matlab
>         >         program
>         >         > consistently takes about 70 seconds to run on this
>         data. The
>         >         julia
>         >         > version shows a lot of variation: it's taken as
>         little as 60
>         >         seconds
>         >         > and as much as 10 minutes. However, most runs take
>         around
>         >         3.5 to 4
>         >         > minutes. I pasted one output from the sampling
>         profiler here
>         >         [1]. If
>         >         > I'm reading this correctly, it looks like the
>         program is
>         >         spending most
>         >         > of its time performing element-wise multiplication
>         of the
>         >         indicator
>         >         > vectors I described above.
>         >         >
>         >         > I would be grateful for any suggestions that would
>         bring
>         >         the
>         >         > performance of the julia program in line with the
>         matlab
>         >         version. I've
>         >         > heard that the last time the matlab code was run
>         on the full
>         >         data set
>         >         > it took a couple days, so a slow-down of 3-4x is a
>         >         signficant burden.
>         >         > I did attempt to write a more idiomatic julia
>         version using
>         >         Dicts and
>         >         > Sets, but it's slower than the version that uses
>         sparse
>         >         matrix
>         >         > operations:
>         >         >
>         >         >
>         >
>          
> https://github.com/jotok/InventorDisambiguator/blob/julia/disambig2.jl
>         >         >
>         >         > Thank you!
>         >         > Josh
>         >         >
>         >         >
>         >         > [1]
>         https://gist.github.com/jotok/6b469a1dc0ff9529caf5
>         >         >
>         >         >
>         >
>         
>         
> 
> 

Reply via email to