Le mardi 11 novembre 2014 à 04:52 -0800, Joshua Tokle a écrit :
> Thanks for your response. Because the operations are on sparse
> matrices I'm pretty sure the arithmetic is already more optimized than
> something I would write:
> 
> https://github.com/JuliaLang/julia/blob/master/base/sparse/sparsematrix.jl#L530
> I did actually spend some time searching for sparse matrix arithmetic
> algorithms, but I didn't come up with anything that seemed like it
> would be an improvement.
Of course the algorithm is optimized for sparse matrices, but every call
to + or .* creates a copy of the matrix, which is not efficient at all.

> It seems that the issue may be with garbage collection. I'm going to
> post a top-level reply with more on this.
I'm not an expert of sparse matrices, but one efficient way of avoid
temporary copies and garbage collection is to manually unroll your loop,
as I advised in my first message. Now I must admit it's more complex
than it seems if the structure of nonzeros is not always the same in the
various matrices you want to combine. Is that the case?


Regards


Regards

> On Monday, November 10, 2014 5:05:29 PM UTC-5, Milan Bouchet-Valat
> wrote:
>         Le lundi 10 novembre 2014 à 13:03 -0800, Joshua Tokle a
>         écrit : 
>         > Hello! I'm trying to replace an existing matlab code with
>         julia and 
>         > I'm having trouble matching the performance of the original
>         code. The 
>         > matlab code is here: 
>         > 
>         >
>         https://github.com/jotok/InventorDisambiguator/blob/julia/Disambig.m 
>         > 
>         > The program clusters inventors from a database of patent
>         applications. 
>         > The input data is a sparse boolean matrix (named XX in the
>         script), 
>         > where each row defines an inventor and each column defines a
>         feature. 
>         > For example, the jth column might correspond to a feature
>         "first name 
>         > is John". If there is a 1 in the XX[i, j], this means that
>         inventor 
>         > i's first name is John. Given an inventor i, we find similar
>         inventors 
>         > by identifying rows in the matrix that agree with XX[i, :]
>         on a given 
>         > column and then applying element-wise boolean operations to
>         the rows. 
>         > In the code, for a given value of `index`, C_lastname holds
>         the unique 
>         > column in XX corresponding to a "last name" feature such
>         that 
>         > XX[index, :] equals 1. C_firstname holds the unique column
>         in XX 
>         > corresponding to a "first name" feature such that
>         XX[index, :] equals 
>         > 1. And so on. The following code snippet finds all rows in
>         the matrix 
>         > that agree with XX[index, :] on full name and one of patent
>         assignee 
>         > name, inventory city, or patent class: 
>         > 
>         >     lump_index_2 = step & ((C_assignee | C_city | C_class)) 
>         > 
>         > The `step` variable is an indicator that's used to prevent
>         the same 
>         > inventors from being considered multiple times. My attempt
>         at a 
>         > literal translation of this code to julia is here: 
>         > 
>         >
>         https://github.com/jotok/InventorDisambiguator/blob/julia/disambig.jl 
>         > 
>         > The matrix X is of type SparseMatrixCSC{Int64, Int64}.
>         Boolean 
>         > operations aren't supported for sparse matrices in julia, so
>         I fake it 
>         > with integer arithmetic.  The line that corresponds to the
>         matlab code 
>         > above is 
>         > 
>         >     lump_index_2 = find(step .* (C_name .* (C_assignee +
>         C_city + C_class))) 
>         You should be able to get a speedup by replacing this line
>         with an 
>         explicit `for` loop. First, you'll avoid memory allocation
>         (one for each 
>         + or .* operation). Second, you'll be able to return as soon
>         as the 
>         index is found, instead of computing the value for all
>         elements (IIUC 
>         you're only looking for one index, right?). 
>         
>         
>         My two cents 
>         
>         > The reason I grouped it this way is that initially `step`
>         will be a 
>         > "sparse" vector of all 1's, and I thought it might help to
>         do the 
>         > truly sparse arithmetic first. 
>         > 
>         > I've been testing this code on a Windows 2008 Server. The
>         test data 
>         > contains 45,763 inventors and 274,578 possible features (in
>         other 
>         > words, XX is an 45,763 x 274,58 sparse matrix). The matlab
>         program 
>         > consistently takes about 70 seconds to run on this data. The
>         julia 
>         > version shows a lot of variation: it's taken as little as 60
>         seconds 
>         > and as much as 10 minutes. However, most runs take around
>         3.5 to 4 
>         > minutes. I pasted one output from the sampling profiler here
>         [1]. If 
>         > I'm reading this correctly, it looks like the program is
>         spending most 
>         > of its time performing element-wise multiplication of the
>         indicator 
>         > vectors I described above. 
>         > 
>         > I would be grateful for any suggestions that would bring
>         the 
>         > performance of the julia program in line with the matlab
>         version. I've 
>         > heard that the last time the matlab code was run on the full
>         data set 
>         > it took a couple days, so a slow-down of 3-4x is a
>         signficant burden. 
>         > I did attempt to write a more idiomatic julia version using
>         Dicts and 
>         > Sets, but it's slower than the version that uses sparse
>         matrix 
>         > operations: 
>         > 
>         >
>         
> https://github.com/jotok/InventorDisambiguator/blob/julia/disambig2.jl 
>         > 
>         > Thank you! 
>         > Josh 
>         > 
>         > 
>         > [1] https://gist.github.com/jotok/6b469a1dc0ff9529caf5 
>         > 
>         > 
>         

Reply via email to