I don't see what about BM25, at least as presented at https://en.wikipedia.org/wiki/Okapi_BM25, should prevent using CSR operations efficiently. Show us your code.
On 1 July 2016 at 08:23, Basil Beirouti <[email protected]> wrote: > Hello everyone, > > I have successfully created a few versions of the BM25Transformer. I > looked at TFIDFTransformer for guidance and I noticed that it outputs a > sparse matrix when given a sparse termcount matrix as an input. > > Unfortunately, the fastest implementation of BM25Transformer that I have > been able to come up with does NOT output a sparse matrix, it will return a > regular numpy matrix. > > Benchmarked against the entire 20newsgroups corpus, here is how they > perform (assuming input is csr_matrix for all): > > 1.) finishes in 4 seconds, outputs a regular numpy matrix > 2.) finishes in 30 seconds, outputs a dok_matrix > 3.) finishes in 130 seconds, outputs a regular numpy matrix > > It's worth noting that using algorithm 1 and converting the output to a > sparse matrix still takes less time than 3, and takes about as long as 2. > > So my question is, how important is it that my BM25Transformer outputs a > sparse matrix? > > I'm going to try another implementation which looks directly at the data, > indices, and indptr attributes of the inputted csr_matrix. I just wanted to > check in and see what people thought. > > Sincerely, > Basil Beirouti > > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
