Le mardi 11 novembre 2014 à 09:54 -0500, Joshua Tokle a écrit : > I think I see: you're suggesting a single loop that performs the > a*b*(c+d+e) operation element-wise, and this will prevent the > allocation of intermediate results. Is that right? Yes.
> Yes, the sparsity pattern will differ between the vectors I'm > operating on, but maybe I can adapt the code from sparsematrix.jl to > my special case. Indeed, this makes my proposition more challenging. A general algorithm to combine several sparse matrices would be very useful. Regards > On Tue, Nov 11, 2014 at 9:43 AM, Milan Bouchet-Valat > <[email protected]> wrote: > Le mardi 11 novembre 2014 à 04:52 -0800, Joshua Tokle a > écrit : > > Thanks for your response. Because the operations are on > sparse > > matrices I'm pretty sure the arithmetic is already more > optimized than > > something I would write: > > > > > > https://github.com/JuliaLang/julia/blob/master/base/sparse/sparsematrix.jl#L530 > > I did actually spend some time searching for sparse matrix > arithmetic > > algorithms, but I didn't come up with anything that seemed > like it > > would be an improvement. > Of course the algorithm is optimized for sparse matrices, but > every call > to + or .* creates a copy of the matrix, which is not > efficient at all. > > > It seems that the issue may be with garbage collection. I'm > going to > > post a top-level reply with more on this. > I'm not an expert of sparse matrices, but one efficient way of > avoid > temporary copies and garbage collection is to manually unroll > your loop, > as I advised in my first message. Now I must admit it's more > complex > than it seems if the structure of nonzeros is not always the > same in the > various matrices you want to combine. Is that the case? > > > Regards > > > Regards > > > On Monday, November 10, 2014 5:05:29 PM UTC-5, Milan > Bouchet-Valat > > wrote: > > Le lundi 10 novembre 2014 à 13:03 -0800, Joshua > Tokle a > > écrit : > > > Hello! I'm trying to replace an existing matlab > code with > > julia and > > > I'm having trouble matching the performance of the > original > > code. The > > > matlab code is here: > > > > > > > > > https://github.com/jotok/InventorDisambiguator/blob/julia/Disambig.m > > > > > > The program clusters inventors from a database of > patent > > applications. > > > The input data is a sparse boolean matrix (named > XX in the > > script), > > > where each row defines an inventor and each column > defines a > > feature. > > > For example, the jth column might correspond to a > feature > > "first name > > > is John". If there is a 1 in the XX[i, j], this > means that > > inventor > > > i's first name is John. Given an inventor i, we > find similar > > inventors > > > by identifying rows in the matrix that agree with > XX[i, :] > > on a given > > > column and then applying element-wise boolean > operations to > > the rows. > > > In the code, for a given value of `index`, > C_lastname holds > > the unique > > > column in XX corresponding to a "last name" > feature such > > that > > > XX[index, :] equals 1. C_firstname holds the > unique column > > in XX > > > corresponding to a "first name" feature such that > > XX[index, :] equals > > > 1. And so on. The following code snippet finds all > rows in > > the matrix > > > that agree with XX[index, :] on full name and one > of patent > > assignee > > > name, inventory city, or patent class: > > > > > > lump_index_2 = step & ((C_assignee | C_city | > C_class)) > > > > > > The `step` variable is an indicator that's used to > prevent > > the same > > > inventors from being considered multiple times. My > attempt > > at a > > > literal translation of this code to julia is here: > > > > > > > > > https://github.com/jotok/InventorDisambiguator/blob/julia/disambig.jl > > > > > > The matrix X is of type SparseMatrixCSC{Int64, > Int64}. > > Boolean > > > operations aren't supported for sparse matrices in > julia, so > > I fake it > > > with integer arithmetic. The line that > corresponds to the > > matlab code > > > above is > > > > > > lump_index_2 = find(step .* (C_name .* > (C_assignee + > > C_city + C_class))) > > You should be able to get a speedup by replacing > this line > > with an > > explicit `for` loop. First, you'll avoid memory > allocation > > (one for each > > + or .* operation). Second, you'll be able to return > as soon > > as the > > index is found, instead of computing the value for > all > > elements (IIUC > > you're only looking for one index, right?). > > > > > > My two cents > > > > > The reason I grouped it this way is that initially > `step` > > will be a > > > "sparse" vector of all 1's, and I thought it might > help to > > do the > > > truly sparse arithmetic first. > > > > > > I've been testing this code on a Windows 2008 > Server. The > > test data > > > contains 45,763 inventors and 274,578 possible > features (in > > other > > > words, XX is an 45,763 x 274,58 sparse matrix). > The matlab > > program > > > consistently takes about 70 seconds to run on this > data. The > > julia > > > version shows a lot of variation: it's taken as > little as 60 > > seconds > > > and as much as 10 minutes. However, most runs take > around > > 3.5 to 4 > > > minutes. I pasted one output from the sampling > profiler here > > [1]. If > > > I'm reading this correctly, it looks like the > program is > > spending most > > > of its time performing element-wise multiplication > of the > > indicator > > > vectors I described above. > > > > > > I would be grateful for any suggestions that would > bring > > the > > > performance of the julia program in line with the > matlab > > version. I've > > > heard that the last time the matlab code was run > on the full > > data set > > > it took a couple days, so a slow-down of 3-4x is a > > signficant burden. > > > I did attempt to write a more idiomatic julia > version using > > Dicts and > > > Sets, but it's slower than the version that uses > sparse > > matrix > > > operations: > > > > > > > > > > https://github.com/jotok/InventorDisambiguator/blob/julia/disambig2.jl > > > > > > Thank you! > > > Josh > > > > > > > > > [1] > https://gist.github.com/jotok/6b469a1dc0ff9529caf5 > > > > > > > > > > > >
