Le lundi 10 novembre 2014 à 13:03 -0800, Joshua Tokle a écrit :
> Hello! I'm trying to replace an existing matlab code with julia and
> I'm having trouble matching the performance of the original code. The
> matlab code is here:
> 
> https://github.com/jotok/InventorDisambiguator/blob/julia/Disambig.m
> 
> The program clusters inventors from a database of patent applications.
> The input data is a sparse boolean matrix (named XX in the script),
> where each row defines an inventor and each column defines a feature.
> For example, the jth column might correspond to a feature "first name
> is John". If there is a 1 in the XX[i, j], this means that inventor
> i's first name is John. Given an inventor i, we find similar inventors
> by identifying rows in the matrix that agree with XX[i, :] on a given
> column and then applying element-wise boolean operations to the rows.
> In the code, for a given value of `index`, C_lastname holds the unique
> column in XX corresponding to a "last name" feature such that
> XX[index, :] equals 1. C_firstname holds the unique column in XX
> corresponding to a "first name" feature such that XX[index, :] equals
> 1. And so on. The following code snippet finds all rows in the matrix
> that agree with XX[index, :] on full name and one of patent assignee
> name, inventory city, or patent class:
> 
>     lump_index_2 = step & ((C_assignee | C_city | C_class))
> 
> The `step` variable is an indicator that's used to prevent the same
> inventors from being considered multiple times. My attempt at a
> literal translation of this code to julia is here:
> 
> https://github.com/jotok/InventorDisambiguator/blob/julia/disambig.jl
> 
> The matrix X is of type SparseMatrixCSC{Int64, Int64}. Boolean
> operations aren't supported for sparse matrices in julia, so I fake it
> with integer arithmetic.  The line that corresponds to the matlab code
> above is
> 
>     lump_index_2 = find(step .* (C_name .* (C_assignee + C_city + C_class)))
You should be able to get a speedup by replacing this line with an
explicit `for` loop. First, you'll avoid memory allocation (one for each
+ or .* operation). Second, you'll be able to return as soon as the
index is found, instead of computing the value for all elements (IIUC
you're only looking for one index, right?).


My two cents

> The reason I grouped it this way is that initially `step` will be a
> "sparse" vector of all 1's, and I thought it might help to do the
> truly sparse arithmetic first.
> 
> I've been testing this code on a Windows 2008 Server. The test data
> contains 45,763 inventors and 274,578 possible features (in other
> words, XX is an 45,763 x 274,58 sparse matrix). The matlab program
> consistently takes about 70 seconds to run on this data. The julia
> version shows a lot of variation: it's taken as little as 60 seconds
> and as much as 10 minutes. However, most runs take around 3.5 to 4
> minutes. I pasted one output from the sampling profiler here [1]. If
> I'm reading this correctly, it looks like the program is spending most
> of its time performing element-wise multiplication of the indicator
> vectors I described above.
> 
> I would be grateful for any suggestions that would bring the
> performance of the julia program in line with the matlab version. I've
> heard that the last time the matlab code was run on the full data set
> it took a couple days, so a slow-down of 3-4x is a signficant burden.
> I did attempt to write a more idiomatic julia version using Dicts and
> Sets, but it's slower than the version that uses sparse matrix
> operations:
> 
> https://github.com/jotok/InventorDisambiguator/blob/julia/disambig2.jl
> 
> Thank you!
> Josh
> 
> 
> [1] https://gist.github.com/jotok/6b469a1dc0ff9529caf5
> 
> 

Reply via email to