I didn't look at your code but it sounds like you are doing row wise
operations.  However, the sparse matrices in Julia (and in Matlab too, I
think) are much faster at column-wise access:

XX[:,i] is fast
XX[i,:] is slow

If you have to do both, then you can consider doing column-wise first
then transpose and do columns again.

On Mon, 2014-11-10 at 22:03, Joshua Tokle <[email protected]> wrote:
> Hello! I'm trying to replace an existing matlab code with julia and I'm 
> having trouble matching the performance of the original code. The matlab 
> code is here:
>     https://github.com/jotok/InventorDisambiguator/blob/julia/Disambig.m
>
> The program clusters inventors from a database of patent applications. The 
> input data is a sparse boolean matrix (named XX in the script), where each 
> row defines an inventor and each column defines a feature. For example, the 
> jth column might correspond to a feature "first name is John". If there is 
> a 1 in the XX[i, j], this means that inventor i's first name is John. Given 
> an inventor i, we find similar inventors by identifying rows in the matrix 
> that agree with XX[i, :] on a given column and then applying element-wise 
> boolean operations to the rows. In the code, for a given value of `index`, 
> C_lastname holds the unique column in XX corresponding to a "last name" 
> feature such that XX[index, :] equals 1. C_firstname holds the unique 
> column in XX corresponding to a "first name" feature such that XX[index, :] 
> equals 1. And so on. The following code snippet finds all rows in the 
> matrix that agree with XX[index, :] on full name and one of patent assignee 
> name, inventory city, or patent class:
>
>     lump_index_2 = step & ((C_assignee | C_city | C_class))
>
> The `step` variable is an indicator that's used to prevent the same 
> inventors from being considered multiple times. My attempt at a literal 
> translation of this code to julia is here:
>     https://github.com/jotok/InventorDisambiguator/blob/julia/disambig.jl
>
> The matrix X is of type SparseMatrixCSC{Int64, Int64}. Boolean operations 
> aren't supported for sparse matrices in julia, so I fake it with integer 
> arithmetic.  The line that corresponds to the matlab code above is
>
>     lump_index_2 = find(step .* (C_name .* (C_assignee + C_city + C_class)))
>
> The reason I grouped it this way is that initially `step` will be a 
> "sparse" vector of all 1's, and I thought it might help to do the truly 
> sparse arithmetic first.
>
> I've been testing this code on a Windows 2008 Server. The test data 
> contains 45,763 inventors and 274,578 possible features (in other words, XX 
> is an 45,763 x 274,58 sparse matrix). The matlab program consistently takes 
> about 70 seconds to run on this data. The julia version shows a lot of 
> variation: it's taken as little as 60 seconds and as much as 10 minutes. 
> However, most runs take around 3.5 to 4 minutes. I pasted one output from 
> the sampling profiler here [1]. If I'm reading this correctly, it looks 
> like the program is spending most of its time performing element-wise 
> multiplication of the indicator vectors I described above.
>
> I would be grateful for any suggestions that would bring the performance of 
> the julia program in line with the matlab version. I've heard that the last 
> time the matlab code was run on the full data set it took a couple days, so 
> a slow-down of 3-4x is a signficant burden. I did attempt to write a more 
> idiomatic julia version using Dicts and Sets, but it's slower than the 
> version that uses sparse matrix operations:
>     https://github.com/jotok/InventorDisambiguator/blob/julia/disambig2.jl
>
> Thank you!
> Josh
>
>
> [1] https://gist.github.com/jotok/6b469a1dc0ff9529caf5

Reply via email to