Hello everyone, I have successfully created a few versions of the BM25Transformer. I looked at TFIDFTransformer for guidance and I noticed that it outputs a sparse matrix when given a sparse termcount matrix as an input.
Unfortunately, the fastest implementation of BM25Transformer that I have been able to come up with does NOT output a sparse matrix, it will return a regular numpy matrix. Benchmarked against the entire 20newsgroups corpus, here is how they perform (assuming input is csr_matrix for all): 1.) finishes in 4 seconds, outputs a regular numpy matrix 2.) finishes in 30 seconds, outputs a dok_matrix 3.) finishes in 130 seconds, outputs a regular numpy matrix It's worth noting that using algorithm 1 and converting the output to a sparse matrix still takes less time than 3, and takes about as long as 2. So my question is, how important is it that my BM25Transformer outputs a sparse matrix? I'm going to try another implementation which looks directly at the data, indices, and indptr attributes of the inputted csr_matrix. I just wanted to check in and see what people thought. Sincerely, Basil Beirouti
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
