Re: [scikit-learn] Adding BM25 to sklearn.feature_extraction.text (Update)

Joel Nothman Thu, 30 Jun 2016 15:41:41 -0700

I don't see what about BM25, at least as presented at
https://en.wikipedia.org/wiki/Okapi_BM25, should prevent using CSR
operations efficiently. Show us your code.


On 1 July 2016 at 08:23, Basil Beirouti <[email protected]> wrote:

> Hello everyone,
>
> I have successfully created a few versions of the BM25Transformer. I
> looked at TFIDFTransformer for guidance and I noticed that it outputs a
> sparse matrix when given a sparse termcount matrix as an input.
>
> Unfortunately, the fastest implementation of BM25Transformer that I have
> been able to come up with does NOT output a sparse matrix, it will return a
> regular numpy matrix.
>
> Benchmarked against the entire 20newsgroups corpus, here is how they
> perform (assuming input is csr_matrix for all):
>
> 1.) finishes in 4 seconds, outputs a regular numpy matrix
> 2.) finishes in 30 seconds, outputs a dok_matrix
> 3.) finishes in 130 seconds, outputs a regular numpy matrix
>
> It's worth noting that using algorithm 1 and converting the output to a
> sparse matrix still takes less time than 3, and takes about as long as 2.
>
> So my question is, how important is it that my BM25Transformer outputs a
> sparse matrix?
>
> I'm going to try another implementation which looks directly at the data,
> indices, and indptr attributes of the inputted csr_matrix. I just wanted to
> check in and see what people thought.
>
> Sincerely,
> Basil Beirouti
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Adding BM25 to sklearn.feature_extraction.text (Update)

Reply via email to