Hi Vlad, Thanks for the quick reply. Unfortunately there's still the question of adding a scalar to every element in sparse matrix, which is not allowed for sparse matrices, and which is not possible to avoid in the equation.
Sincerely, Basil Beirouti > On Jul 1, 2016, at 4:36 PM, [email protected] wrote: > > Send scikit-learn mailing list submissions to > [email protected] > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > [email protected] > > You can reach the person managing the list at > [email protected] > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Adding BM25 to scikit-learn.feature_extraction.text > (Basil Beirouti) > 2. Re: Adding BM25 to scikit-learn.feature_extraction.text > (Vlad Niculae) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 1 Jul 2016 16:17:43 -0500 > From: Basil Beirouti <[email protected]> > To: [email protected] > Subject: [scikit-learn] Adding BM25 to > scikit-learn.feature_extraction.text > Message-ID: > <cab4mtg8805nndaja5cscf+phrjyq0btc-agzegd8cqb95sv...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hi everyone, > > to put it succinctly, here's the BM25 equation: > > f(w,D) * (k+1) / (k*B + f(w,D)) > > where w is the word, and D is the document (corresponding to rows and > columns, respectively). f is a sparse matrix because only a fraction of the > whole vocabulary of words appears in any given single document. > > B is a function of only the document, but it doesn't matter, you can think > of it as a constant if you want. > > The problem is since f(w,D) is almost always zero, I only need to do the > calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when > f(w,D) is not zero. Is there a clever way to do this with masks? > > You can refactor the above equation to get this: > > (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a > denominator, which is bad (because of dividing by zero). > > So anyway, currently I am converting to a coo_matrix and iterator through > the non-zero values like this: > > cx = x.tocoo() > for i,j,v in itertools.izip(cx.row, cx.col, cx.data): > (i,j,v) > > > That iterator is incredibly fast, but unfortunately coo_matrix does > not support assignment. So I create a new copy of either a dok sparse > matrix or a regular numpy array and assign to that. > > I could also deal directly with the .data, .indptr, and indices > attributes of csr_matrix, and see if it's possible to create a copy of > .data attribute and update the values accordingly. I was hoping > somebody had encountered this type of issue before. > > Sincerely, > > Basil Beirouti > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/8970d05a/attachment-0001.html> > > ------------------------------ > > Message: 2 > Date: Fri, 01 Jul 2016 17:35:49 -0400 > From: Vlad Niculae <[email protected]> > To: Scikit-learn user and developer mailing list > <[email protected]> > Subject: Re: [scikit-learn] Adding BM25 to > scikit-learn.feature_extraction.text > Message-ID: <[email protected]> > Content-Type: text/plain; charset="utf-8" > > Hi Basil, > > If B were just a constant, you could do the whole thing as a vectorized > operation on X.data. > > Since I understand B is a n_samples vector, I think the cleanest way to > compute the denominator is using sklearn.utils.sparsefuncs.inplace_row_scale. > > Hope this helps, > > Vlad > > >> On July 1, 2016 5:17:43 PM EDT, Basil Beirouti <[email protected]> >> wrote: >> Hi everyone, >> >> to put it succinctly, here's the BM25 equation: >> >> f(w,D) * (k+1) / (k*B + f(w,D)) >> >> where w is the word, and D is the document (corresponding to rows and >> columns, respectively). f is a sparse matrix because only a fraction of >> the >> whole vocabulary of words appears in any given single document. >> >> B is a function of only the document, but it doesn't matter, you can >> think >> of it as a constant if you want. >> >> The problem is since f(w,D) is almost always zero, I only need to do >> the >> calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when >> f(w,D) is not zero. Is there a clever way to do this with masks? >> >> You can refactor the above equation to get this: >> >> (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a >> denominator, which is bad (because of dividing by zero). >> >> So anyway, currently I am converting to a coo_matrix and iterator >> through >> the non-zero values like this: >> >> cx = x.tocoo() >> for i,j,v in itertools.izip(cx.row, cx.col, cx.data): >> (i,j,v) >> >> >> That iterator is incredibly fast, but unfortunately coo_matrix does >> not support assignment. So I create a new copy of either a dok sparse >> matrix or a regular numpy array and assign to that. >> >> I could also deal directly with the .data, .indptr, and indices >> attributes of csr_matrix, and see if it's possible to create a copy of >> .data attribute and update the values accordingly. I was hoping >> somebody had encountered this type of issue before. >> >> Sincerely, >> >> Basil Beirouti >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> scikit-learn mailing list >> [email protected] >> https://mail.python.org/mailman/listinfo/scikit-learn > > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/ca1e4e96/attachment.html> > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 4, Issue 3 > ****************************************** _______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
