I should have posted the output from @time in the initial post. The version of the code using sparse matrices is reporting 87% gc time. I thought I could fix the problem in the second version of the code that uses Sets instead of sparse matrices. Indeed, v2 only reports about 20% gc time, but the overall run time is still worse (about 1.6x the sparse matrix version).
In the sparse matrix version, several sparse vectors are being created in each loop iteration -- I suspect this is driving the garbage collection. If I was using dense arrays I would try preallocating buffers, but I don't know how to do this in a sparse setting. Are there any tricks for optimizing the create/destruction of ephemeral structs? On Monday, November 10, 2014 5:05:29 PM UTC-5, Milan Bouchet-Valat wrote: > > Le lundi 10 novembre 2014 à 13:03 -0800, Joshua Tokle a écrit : > > Hello! I'm trying to replace an existing matlab code with julia and > > I'm having trouble matching the performance of the original code. The > > matlab code is here: > > > > https://github.com/jotok/InventorDisambiguator/blob/julia/Disambig.m > > > > The program clusters inventors from a database of patent applications. > > The input data is a sparse boolean matrix (named XX in the script), > > where each row defines an inventor and each column defines a feature. > > For example, the jth column might correspond to a feature "first name > > is John". If there is a 1 in the XX[i, j], this means that inventor > > i's first name is John. Given an inventor i, we find similar inventors > > by identifying rows in the matrix that agree with XX[i, :] on a given > > column and then applying element-wise boolean operations to the rows. > > In the code, for a given value of `index`, C_lastname holds the unique > > column in XX corresponding to a "last name" feature such that > > XX[index, :] equals 1. C_firstname holds the unique column in XX > > corresponding to a "first name" feature such that XX[index, :] equals > > 1. And so on. The following code snippet finds all rows in the matrix > > that agree with XX[index, :] on full name and one of patent assignee > > name, inventory city, or patent class: > > > > lump_index_2 = step & ((C_assignee | C_city | C_class)) > > > > The `step` variable is an indicator that's used to prevent the same > > inventors from being considered multiple times. My attempt at a > > literal translation of this code to julia is here: > > > > https://github.com/jotok/InventorDisambiguator/blob/julia/disambig.jl > > > > The matrix X is of type SparseMatrixCSC{Int64, Int64}. Boolean > > operations aren't supported for sparse matrices in julia, so I fake it > > with integer arithmetic. The line that corresponds to the matlab code > > above is > > > > lump_index_2 = find(step .* (C_name .* (C_assignee + C_city + > C_class))) > You should be able to get a speedup by replacing this line with an > explicit `for` loop. First, you'll avoid memory allocation (one for each > + or .* operation). Second, you'll be able to return as soon as the > index is found, instead of computing the value for all elements (IIUC > you're only looking for one index, right?). > > > My two cents > > > The reason I grouped it this way is that initially `step` will be a > > "sparse" vector of all 1's, and I thought it might help to do the > > truly sparse arithmetic first. > > > > I've been testing this code on a Windows 2008 Server. The test data > > contains 45,763 inventors and 274,578 possible features (in other > > words, XX is an 45,763 x 274,58 sparse matrix). The matlab program > > consistently takes about 70 seconds to run on this data. The julia > > version shows a lot of variation: it's taken as little as 60 seconds > > and as much as 10 minutes. However, most runs take around 3.5 to 4 > > minutes. I pasted one output from the sampling profiler here [1]. If > > I'm reading this correctly, it looks like the program is spending most > > of its time performing element-wise multiplication of the indicator > > vectors I described above. > > > > I would be grateful for any suggestions that would bring the > > performance of the julia program in line with the matlab version. I've > > heard that the last time the matlab code was run on the full data set > > it took a couple days, so a slow-down of 3-4x is a signficant burden. > > I did attempt to write a more idiomatic julia version using Dicts and > > Sets, but it's slower than the version that uses sparse matrix > > operations: > > > > https://github.com/jotok/InventorDisambiguator/blob/julia/disambig2.jl > > > > Thank you! > > Josh > > > > > > [1] https://gist.github.com/jotok/6b469a1dc0ff9529caf5 > > > > > >
