Thanks. The instances of XX[i, :] that appear in my post here are just pseudo-code. In the actual implementation only column-wise slices are used.
On Tuesday, November 11, 2014 3:49:00 AM UTC-5, Mauro wrote: > > I didn't look at your code but it sounds like you are doing row wise > operations. However, the sparse matrices in Julia (and in Matlab too, I > think) are much faster at column-wise access: > > XX[:,i] is fast > XX[i,:] is slow > > If you have to do both, then you can consider doing column-wise first > then transpose and do columns again. > > On Mon, 2014-11-10 at 22:03, Joshua Tokle <[email protected] <javascript:>> > wrote: > > Hello! I'm trying to replace an existing matlab code with julia and I'm > > having trouble matching the performance of the original code. The matlab > > code is here: > > https://github.com/jotok/InventorDisambiguator/blob/julia/Disambig.m > > > > The program clusters inventors from a database of patent applications. > The > > input data is a sparse boolean matrix (named XX in the script), where > each > > row defines an inventor and each column defines a feature. For example, > the > > jth column might correspond to a feature "first name is John". If there > is > > a 1 in the XX[i, j], this means that inventor i's first name is John. > Given > > an inventor i, we find similar inventors by identifying rows in the > matrix > > that agree with XX[i, :] on a given column and then applying > element-wise > > boolean operations to the rows. In the code, for a given value of > `index`, > > C_lastname holds the unique column in XX corresponding to a "last name" > > feature such that XX[index, :] equals 1. C_firstname holds the unique > > column in XX corresponding to a "first name" feature such that XX[index, > :] > > equals 1. And so on. The following code snippet finds all rows in the > > matrix that agree with XX[index, :] on full name and one of patent > assignee > > name, inventory city, or patent class: > > > > lump_index_2 = step & ((C_assignee | C_city | C_class)) > > > > The `step` variable is an indicator that's used to prevent the same > > inventors from being considered multiple times. My attempt at a literal > > translation of this code to julia is here: > > > https://github.com/jotok/InventorDisambiguator/blob/julia/disambig.jl > > > > The matrix X is of type SparseMatrixCSC{Int64, Int64}. Boolean > operations > > aren't supported for sparse matrices in julia, so I fake it with integer > > arithmetic. The line that corresponds to the matlab code above is > > > > lump_index_2 = find(step .* (C_name .* (C_assignee + C_city + > C_class))) > > > > The reason I grouped it this way is that initially `step` will be a > > "sparse" vector of all 1's, and I thought it might help to do the truly > > sparse arithmetic first. > > > > I've been testing this code on a Windows 2008 Server. The test data > > contains 45,763 inventors and 274,578 possible features (in other words, > XX > > is an 45,763 x 274,58 sparse matrix). The matlab program consistently > takes > > about 70 seconds to run on this data. The julia version shows a lot of > > variation: it's taken as little as 60 seconds and as much as 10 minutes. > > However, most runs take around 3.5 to 4 minutes. I pasted one output > from > > the sampling profiler here [1]. If I'm reading this correctly, it looks > > like the program is spending most of its time performing element-wise > > multiplication of the indicator vectors I described above. > > > > I would be grateful for any suggestions that would bring the performance > of > > the julia program in line with the matlab version. I've heard that the > last > > time the matlab code was run on the full data set it took a couple days, > so > > a slow-down of 3-4x is a signficant burden. I did attempt to write a > more > > idiomatic julia version using Dicts and Sets, but it's slower than the > > version that uses sparse matrix operations: > > > https://github.com/jotok/InventorDisambiguator/blob/julia/disambig2.jl > > > > Thank you! > > Josh > > > > > > [1] https://gist.github.com/jotok/6b469a1dc0ff9529caf5 > >
