I think I see: you're suggesting a single loop that performs the
a*b*(c+d+e) operation element-wise, and this will prevent the allocation of
intermediate results. Is that right?

Yes, the sparsity pattern will differ between the vectors I'm operating on,
but maybe I can adapt the code from sparsematrix.jl to my special case.

On Tue, Nov 11, 2014 at 9:43 AM, Milan Bouchet-Valat <[email protected]>
wrote:

> Le mardi 11 novembre 2014 à 04:52 -0800, Joshua Tokle a écrit :
> > Thanks for your response. Because the operations are on sparse
> > matrices I'm pretty sure the arithmetic is already more optimized than
> > something I would write:
> >
> >
> https://github.com/JuliaLang/julia/blob/master/base/sparse/sparsematrix.jl#L530
> > I did actually spend some time searching for sparse matrix arithmetic
> > algorithms, but I didn't come up with anything that seemed like it
> > would be an improvement.
> Of course the algorithm is optimized for sparse matrices, but every call
> to + or .* creates a copy of the matrix, which is not efficient at all.
>
> > It seems that the issue may be with garbage collection. I'm going to
> > post a top-level reply with more on this.
> I'm not an expert of sparse matrices, but one efficient way of avoid
> temporary copies and garbage collection is to manually unroll your loop,
> as I advised in my first message. Now I must admit it's more complex
> than it seems if the structure of nonzeros is not always the same in the
> various matrices you want to combine. Is that the case?
>
>
> Regards
>
>
> Regards
>
> > On Monday, November 10, 2014 5:05:29 PM UTC-5, Milan Bouchet-Valat
> > wrote:
> >         Le lundi 10 novembre 2014 à 13:03 -0800, Joshua Tokle a
> >         écrit :
> >         > Hello! I'm trying to replace an existing matlab code with
> >         julia and
> >         > I'm having trouble matching the performance of the original
> >         code. The
> >         > matlab code is here:
> >         >
> >         >
> >
> https://github.com/jotok/InventorDisambiguator/blob/julia/Disambig.m
> >         >
> >         > The program clusters inventors from a database of patent
> >         applications.
> >         > The input data is a sparse boolean matrix (named XX in the
> >         script),
> >         > where each row defines an inventor and each column defines a
> >         feature.
> >         > For example, the jth column might correspond to a feature
> >         "first name
> >         > is John". If there is a 1 in the XX[i, j], this means that
> >         inventor
> >         > i's first name is John. Given an inventor i, we find similar
> >         inventors
> >         > by identifying rows in the matrix that agree with XX[i, :]
> >         on a given
> >         > column and then applying element-wise boolean operations to
> >         the rows.
> >         > In the code, for a given value of `index`, C_lastname holds
> >         the unique
> >         > column in XX corresponding to a "last name" feature such
> >         that
> >         > XX[index, :] equals 1. C_firstname holds the unique column
> >         in XX
> >         > corresponding to a "first name" feature such that
> >         XX[index, :] equals
> >         > 1. And so on. The following code snippet finds all rows in
> >         the matrix
> >         > that agree with XX[index, :] on full name and one of patent
> >         assignee
> >         > name, inventory city, or patent class:
> >         >
> >         >     lump_index_2 = step & ((C_assignee | C_city | C_class))
> >         >
> >         > The `step` variable is an indicator that's used to prevent
> >         the same
> >         > inventors from being considered multiple times. My attempt
> >         at a
> >         > literal translation of this code to julia is here:
> >         >
> >         >
> >
> https://github.com/jotok/InventorDisambiguator/blob/julia/disambig.jl
> >         >
> >         > The matrix X is of type SparseMatrixCSC{Int64, Int64}.
> >         Boolean
> >         > operations aren't supported for sparse matrices in julia, so
> >         I fake it
> >         > with integer arithmetic.  The line that corresponds to the
> >         matlab code
> >         > above is
> >         >
> >         >     lump_index_2 = find(step .* (C_name .* (C_assignee +
> >         C_city + C_class)))
> >         You should be able to get a speedup by replacing this line
> >         with an
> >         explicit `for` loop. First, you'll avoid memory allocation
> >         (one for each
> >         + or .* operation). Second, you'll be able to return as soon
> >         as the
> >         index is found, instead of computing the value for all
> >         elements (IIUC
> >         you're only looking for one index, right?).
> >
> >
> >         My two cents
> >
> >         > The reason I grouped it this way is that initially `step`
> >         will be a
> >         > "sparse" vector of all 1's, and I thought it might help to
> >         do the
> >         > truly sparse arithmetic first.
> >         >
> >         > I've been testing this code on a Windows 2008 Server. The
> >         test data
> >         > contains 45,763 inventors and 274,578 possible features (in
> >         other
> >         > words, XX is an 45,763 x 274,58 sparse matrix). The matlab
> >         program
> >         > consistently takes about 70 seconds to run on this data. The
> >         julia
> >         > version shows a lot of variation: it's taken as little as 60
> >         seconds
> >         > and as much as 10 minutes. However, most runs take around
> >         3.5 to 4
> >         > minutes. I pasted one output from the sampling profiler here
> >         [1]. If
> >         > I'm reading this correctly, it looks like the program is
> >         spending most
> >         > of its time performing element-wise multiplication of the
> >         indicator
> >         > vectors I described above.
> >         >
> >         > I would be grateful for any suggestions that would bring
> >         the
> >         > performance of the julia program in line with the matlab
> >         version. I've
> >         > heard that the last time the matlab code was run on the full
> >         data set
> >         > it took a couple days, so a slow-down of 3-4x is a
> >         signficant burden.
> >         > I did attempt to write a more idiomatic julia version using
> >         Dicts and
> >         > Sets, but it's slower than the version that uses sparse
> >         matrix
> >         > operations:
> >         >
> >         >
> >
> https://github.com/jotok/InventorDisambiguator/blob/julia/disambig2.jl
> >         >
> >         > Thank you!
> >         > Josh
> >         >
> >         >
> >         > [1] https://gist.github.com/jotok/6b469a1dc0ff9529caf5
> >         >
> >         >
> >
>
>

Reply via email to