Re: [scikit-learn] Bm25

Vlad Niculae Fri, 01 Jul 2016 15:39:07 -0700

In the denominator you mean? It looks like you only need to add that to nonzero 
elements, since the others would all have a 0 in the numerator, right? So the 
final value would be zero there. Or am I missing something?


You can initialize an array with the same sparsity pattern as X, but its data 
is k everywhere. Then use inplace_row_scale to multiply it by B, then add this 
to X to get the denominator.

On July 1, 2016 6:27:41 PM EDT, Basil Beirouti <[email protected]> wrote:
>Hi Vlad,
>
>Thanks for the quick reply. Unfortunately there's still the question of
>adding a scalar to every element in sparse matrix, which is not allowed
>for sparse matrices, and which is not possible to avoid in the
>equation.
>
>Sincerely,
>Basil Beirouti 
>
>
>> On Jul 1, 2016, at 4:36 PM, [email protected] wrote:
>> 
>> Send scikit-learn mailing list submissions to
>>    [email protected]
>> 
>> To subscribe or unsubscribe via the World Wide Web, visit
>>    https://mail.python.org/mailman/listinfo/scikit-learn
>> or, via email, send a message with subject or body 'help' to
>>    [email protected]
>> 
>> You can reach the person managing the list at
>>    [email protected]
>> 
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of scikit-learn digest..."
>> 
>> 
>> Today's Topics:
>> 
>>   1. Adding BM25 to scikit-learn.feature_extraction.text
>>      (Basil Beirouti)
>>   2. Re: Adding BM25 to scikit-learn.feature_extraction.text
>>      (Vlad Niculae)
>> 
>> 
>>
>----------------------------------------------------------------------
>> 
>> Message: 1
>> Date: Fri, 1 Jul 2016 16:17:43 -0500
>> From: Basil Beirouti <[email protected]>
>> To: [email protected]
>> Subject: [scikit-learn] Adding BM25 to
>>    scikit-learn.feature_extraction.text
>> Message-ID:
>>   
><cab4mtg8805nndaja5cscf+phrjyq0btc-agzegd8cqb95sv...@mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>> 
>> Hi everyone,
>> 
>> to put it succinctly, here's the BM25 equation:
>> 
>> f(w,D) * (k+1) / (k*B + f(w,D))
>> 
>> where w is the word, and D is the document (corresponding to rows and
>> columns, respectively). f is a sparse matrix because only a fraction
>of the
>> whole vocabulary of words appears in any given single document.
>> 
>> B is a function of only the document, but it doesn't matter, you can
>think
>> of it as a constant if you want.
>> 
>> The problem is since f(w,D) is almost always zero, I only need to do
>the
>> calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D)))
>when
>> f(w,D) is not zero. Is there a clever way to do this with masks?
>> 
>> You can refactor the above equation to get this:
>> 
>> (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
>> denominator, which is bad (because of dividing by zero).
>> 
>> So anyway, currently I am converting to a coo_matrix and iterator
>through
>> the non-zero values like this:
>> 
>>    cx = x.tocoo()
>>    for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>>        (i,j,v)
>> 
>> 
>> That iterator is incredibly fast, but unfortunately coo_matrix does
>> not support assignment. So I create a new copy of either a dok sparse
>> matrix or a regular numpy array and assign to that.
>> 
>> I could also deal directly with the .data, .indptr, and indices
>> attributes of csr_matrix, and see if it's possible to create a copy
>of
>> .data attribute and update the values accordingly. I was hoping
>> somebody had encountered this type of issue before.
>> 
>> Sincerely,
>> 
>> Basil Beirouti
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
><http://mail.python.org/pipermail/scikit-learn/attachments/20160701/8970d05a/attachment-0001.html>
>> 
>> ------------------------------
>> 
>> Message: 2
>> Date: Fri, 01 Jul 2016 17:35:49 -0400
>> From: Vlad Niculae <[email protected]>
>> To: Scikit-learn user and developer mailing list
>>    <[email protected]>
>> Subject: Re: [scikit-learn] Adding BM25 to
>>    scikit-learn.feature_extraction.text
>> Message-ID: <[email protected]>
>> Content-Type: text/plain; charset="utf-8"
>> 
>> Hi Basil,
>> 
>> If B were just a constant, you could do the whole thing as a
>vectorized operation on X.data.
>> 
>> Since I understand B is a n_samples vector, I think the cleanest way
>to compute the denominator is using
>sklearn.utils.sparsefuncs.inplace_row_scale.
>> 
>> Hope this helps,
>> 
>> Vlad
>> 
>> 
>>> On July 1, 2016 5:17:43 PM EDT, Basil Beirouti
><[email protected]> wrote:
>>> Hi everyone,
>>> 
>>> to put it succinctly, here's the BM25 equation:
>>> 
>>> f(w,D) * (k+1) / (k*B + f(w,D))
>>> 
>>> where w is the word, and D is the document (corresponding to rows
>and
>>> columns, respectively). f is a sparse matrix because only a fraction
>of
>>> the
>>> whole vocabulary of words appears in any given single document.
>>> 
>>> B is a function of only the document, but it doesn't matter, you can
>>> think
>>> of it as a constant if you want.
>>> 
>>> The problem is since f(w,D) is almost always zero, I only need to do
>>> the
>>> calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D)))
>when
>>> f(w,D) is not zero. Is there a clever way to do this with masks?
>>> 
>>> You can refactor the above equation to get this:
>>> 
>>> (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
>>> denominator, which is bad (because of dividing by zero).
>>> 
>>> So anyway, currently I am converting to a coo_matrix and iterator
>>> through
>>> the non-zero values like this:
>>> 
>>>   cx = x.tocoo()
>>>   for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>>>       (i,j,v)
>>> 
>>> 
>>> That iterator is incredibly fast, but unfortunately coo_matrix does
>>> not support assignment. So I create a new copy of either a dok
>sparse
>>> matrix or a regular numpy array and assign to that.
>>> 
>>> I could also deal directly with the .data, .indptr, and indices
>>> attributes of csr_matrix, and see if it's possible to create a copy
>of
>>> .data attribute and update the values accordingly. I was hoping
>>> somebody had encountered this type of issue before.
>>> 
>>> Sincerely,
>>> 
>>> Basil Beirouti
>>> 
>>> 
>>>
>------------------------------------------------------------------------
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> [email protected]
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> -- 
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
><http://mail.python.org/pipermail/scikit-learn/attachments/20160701/ca1e4e96/attachment.html>
>> 
>> ------------------------------
>> 
>> Subject: Digest Footer
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> ------------------------------
>> 
>> End of scikit-learn Digest, Vol 4, Issue 3
>> ******************************************
>_______________________________________________
>scikit-learn mailing list
>[email protected]
>https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Bm25

Reply via email to