Hello! I'm trying to replace an existing matlab code with julia and I'm 
having trouble matching the performance of the original code. The matlab 
code is here:
    https://github.com/jotok/InventorDisambiguator/blob/julia/Disambig.m

The program clusters inventors from a database of patent applications. The 
input data is a sparse boolean matrix (named XX in the script), where each 
row defines an inventor and each column defines a feature. For example, the 
jth column might correspond to a feature "first name is John". If there is 
a 1 in the XX[i, j], this means that inventor i's first name is John. Given 
an inventor i, we find similar inventors by identifying rows in the matrix 
that agree with XX[i, :] on a given column and then applying element-wise 
boolean operations to the rows. In the code, for a given value of `index`, 
C_lastname holds the unique column in XX corresponding to a "last name" 
feature such that XX[index, :] equals 1. C_firstname holds the unique 
column in XX corresponding to a "first name" feature such that XX[index, :] 
equals 1. And so on. The following code snippet finds all rows in the 
matrix that agree with XX[index, :] on full name and one of patent assignee 
name, inventory city, or patent class:

    lump_index_2 = step & ((C_assignee | C_city | C_class))

The `step` variable is an indicator that's used to prevent the same 
inventors from being considered multiple times. My attempt at a literal 
translation of this code to julia is here:
    https://github.com/jotok/InventorDisambiguator/blob/julia/disambig.jl

The matrix X is of type SparseMatrixCSC{Int64, Int64}. Boolean operations 
aren't supported for sparse matrices in julia, so I fake it with integer 
arithmetic.  The line that corresponds to the matlab code above is

    lump_index_2 = find(step .* (C_name .* (C_assignee + C_city + C_class)))

The reason I grouped it this way is that initially `step` will be a 
"sparse" vector of all 1's, and I thought it might help to do the truly 
sparse arithmetic first.

I've been testing this code on a Windows 2008 Server. The test data 
contains 45,763 inventors and 274,578 possible features (in other words, XX 
is an 45,763 x 274,58 sparse matrix). The matlab program consistently takes 
about 70 seconds to run on this data. The julia version shows a lot of 
variation: it's taken as little as 60 seconds and as much as 10 minutes. 
However, most runs take around 3.5 to 4 minutes. I pasted one output from 
the sampling profiler here [1]. If I'm reading this correctly, it looks 
like the program is spending most of its time performing element-wise 
multiplication of the indicator vectors I described above.

I would be grateful for any suggestions that would bring the performance of 
the julia program in line with the matlab version. I've heard that the last 
time the matlab code was run on the full data set it took a couple days, so 
a slow-down of 3-4x is a signficant burden. I did attempt to write a more 
idiomatic julia version using Dicts and Sets, but it's slower than the 
version that uses sparse matrix operations:
    https://github.com/jotok/InventorDisambiguator/blob/julia/disambig2.jl

Thank you!
Josh


[1] https://gist.github.com/jotok/6b469a1dc0ff9529caf5

Reply via email to