Hello! I'm trying to replace an existing matlab code with julia and I'm
having trouble matching the performance of the original code. The matlab
code is here:
https://github.com/jotok/InventorDisambiguator/blob/julia/Disambig.m
The program clusters inventors from a database of patent applications. The
input data is a sparse boolean matrix (named XX in the script), where each
row defines an inventor and each column defines a feature. For example, the
jth column might correspond to a feature "first name is John". If there is
a 1 in the XX[i, j], this means that inventor i's first name is John. Given
an inventor i, we find similar inventors by identifying rows in the matrix
that agree with XX[i, :] on a given column and then applying element-wise
boolean operations to the rows. In the code, for a given value of `index`,
C_lastname holds the unique column in XX corresponding to a "last name"
feature such that XX[index, :] equals 1. C_firstname holds the unique
column in XX corresponding to a "first name" feature such that XX[index, :]
equals 1. And so on. The following code snippet finds all rows in the
matrix that agree with XX[index, :] on full name and one of patent assignee
name, inventory city, or patent class:
lump_index_2 = step & ((C_assignee | C_city | C_class))
The `step` variable is an indicator that's used to prevent the same
inventors from being considered multiple times. My attempt at a literal
translation of this code to julia is here:
https://github.com/jotok/InventorDisambiguator/blob/julia/disambig.jl
The matrix X is of type SparseMatrixCSC{Int64, Int64}. Boolean operations
aren't supported for sparse matrices in julia, so I fake it with integer
arithmetic. The line that corresponds to the matlab code above is
lump_index_2 = find(step .* (C_name .* (C_assignee + C_city + C_class)))
The reason I grouped it this way is that initially `step` will be a
"sparse" vector of all 1's, and I thought it might help to do the truly
sparse arithmetic first.
I've been testing this code on a Windows 2008 Server. The test data
contains 45,763 inventors and 274,578 possible features (in other words, XX
is an 45,763 x 274,58 sparse matrix). The matlab program consistently takes
about 70 seconds to run on this data. The julia version shows a lot of
variation: it's taken as little as 60 seconds and as much as 10 minutes.
However, most runs take around 3.5 to 4 minutes. I pasted one output from
the sampling profiler here [1]. If I'm reading this correctly, it looks
like the program is spending most of its time performing element-wise
multiplication of the indicator vectors I described above.
I would be grateful for any suggestions that would bring the performance of
the julia program in line with the matlab version. I've heard that the last
time the matlab code was run on the full data set it took a couple days, so
a slow-down of 3-4x is a signficant burden. I did attempt to write a more
idiomatic julia version using Dicts and Sets, but it's slower than the
version that uses sparse matrix operations:
https://github.com/jotok/InventorDisambiguator/blob/julia/disambig2.jl
Thank you!
Josh
[1] https://gist.github.com/jotok/6b469a1dc0ff9529caf5