Hi, Basil,

I’d say runtime may not be the main concern regarding sparse vs. dense. In my 
opinion, the main reason to use sparse arrays would be memory useage. I.e., 
text data is typically rather large (esp. high-dimensional, sparse feature 
vector). So one limitation with scikit-learn is typically memory capacity, 
especially if you are using multiprocessing via the cv param.

PS:

> regular numpy matrix

I think you mean "numpy array”? (Since there’s a numpy matrix datastruct in 
numpy as well, however, almost no one uses it)

Best,
Sebastian

> On Jun 30, 2016, at 6:23 PM, Basil Beirouti <basilbeiro...@gmail.com> wrote:
> 
> Hello everyone, 
> 
> I have successfully created a few versions of the BM25Transformer. I looked 
> at TFIDFTransformer for guidance and I noticed that it outputs a sparse 
> matrix when given a sparse termcount matrix as an input. 
> 
> Unfortunately, the fastest implementation of BM25Transformer that I have been 
> able to come up with does NOT output a sparse matrix, it will return a 
> regular numpy matrix. 
> 
> Benchmarked against the entire 20newsgroups corpus, here is how they perform 
> (assuming input is csr_matrix for all):
> 
> 1.) finishes in 4 seconds, outputs a regular numpy matrix
> 2.) finishes in 30 seconds, outputs a dok_matrix
> 3.) finishes in 130 seconds, outputs a regular numpy matrix
> 
> It's worth noting that using algorithm 1 and converting the output to a 
> sparse matrix still takes less time than 3, and takes about as long as 2. 
> 
> So my question is, how important is it that my BM25Transformer outputs a 
> sparse matrix? 
> 
> I'm going to try another implementation which looks directly at the data, 
> indices, and indptr attributes of the inputted csr_matrix. I just wanted to 
> check in and see what people thought.
> 
> Sincerely,
> Basil Beirouti
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to