Re: [scikit-learn] Bm25

Basil Beirouti Fri, 01 Jul 2016 15:50:23 -0700

Oh yes that's exactly what I was looking for. So how do I initialize an array 
with the same sparsity pattern as X? And then how do I do an element wise 
divide of the numerator over the denominator, when both are sparse matrices? 
Like you said it should only do this operation on the non zero elements of the 
numerator.


Sent from my iPhone

> On Jul 1, 2016, at 5:36 PM, Vlad Niculae <[email protected]> wrote:
> 
> In the denominator you mean? It looks like you only need to add that to 
> nonzero elements, since the others would all have a 0 in the numerator, 
> right? So the final value would be zero there. Or am I missing something?
> 
> You can initialize an array with the same sparsity pattern as X, but its data 
> is k everywhere. Then use inplace_row_scale to multiply it by B, then add 
> this to X to get the denominator.
> 
>> On July 1, 2016 6:27:41 PM EDT, Basil Beirouti <[email protected]> 
>> wrote:
>> Hi Vlad,
>> 
>> Thanks for the quick reply. Unfortunately there's still the question of 
>> adding a scalar to every element in sparse matrix, which is not allowed for 
>> sparse matrices, and which is not possible to avoid in the equation.
>> 
>> Sincerely,
>> Basil Beirouti 
>> 
>> 
>>>  On Jul 1, 2016, at 4:36 PM, [email protected] wrote:
>>>  
>>>  Send scikit-learn mailing list submissions to
>>>     [email protected]
>>>  
>>>  To subscribe or unsubscribe via the World Wide Web, visit
>>>     https://mail.python.org/mailman/listinfo/scikit-learn
>>>  or, via email, send a message with subject or body 'help' to
>>>     [email protected]
>>>  
>>>  You can reach the person managing the list at
>>>    
>>> [email protected]
>>>  
>>>  When replying, please edit your Subject line so it is more specific
>>>  than "Re: Contents of scikit-learn digest..."
>>>  
>>>  
>>>  Today's Topics:
>>>  
>>>    1. Adding BM25 to scikit-learn.feature_extraction.text
>>>       (Basil Beirouti)
>>>    2. Re: Adding BM25 to scikit-learn.feature_extraction.text
>>>       (Vlad Niculae)
>>>  
>>>  
>>> 
>>>  
>>>  Message: 1
>>>  Date: Fri, 1 Jul 2016 16:17:43 -0500
>>>  From: Basil Beirouti <[email protected]>
>>>  To: [email protected]
>>>  Subject: [scikit-learn] Adding BM25 to
>>>     scikit-learn.feature_extraction.text
>>>  Message-ID:
>>>     <cab4mtg8805nndaja5cscf+phrjyq0btc-agzegd8cqb95sv...@mail.gmail.com>
>>>  Content-Type: text/plain; charset="utf-8"
>>>  
>>>  Hi everyone,
>>>  
>>>  to put it succinctly, here's the BM25 equation:
>>>  
>>>  f(w,D) * (k+1) / (k*B + f(w,D))
>>>  
>>>  where w is the word, and D is the
>>> document (corresponding to rows and
>>>  columns, respectively). f is a sparse matrix because only a fraction of the
>>>  whole vocabulary of words appears in any given single document.
>>>  
>>>  B is a function of only the document, but it doesn't matter, you can think
>>>  of it as a constant if you want.
>>>  
>>>  The problem is since f(w,D) is almost always zero, I only need to do the
>>>  calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when
>>>  f(w,D) is not zero. Is there a clever way to do this with masks?
>>>  
>>>  You can refactor the above equation to get this:
>>>  
>>>  (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
>>>  denominator, which is bad (because of dividing by zero).
>>>  
>>>  So anyway, currently I am converting to a coo_matrix and iterator through
>>>  the non-zero values like this:
>>>  
>>>     cx = x.tocoo()
>>>     for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>>>         (i,j,v)
>>>  
>>>  
>>>  That iterator is incredibly fast, but unfortunately coo_matrix does
>>>  not support assignment. So I create a new copy of either a dok sparse
>>>  matrix or a regular numpy array and assign to that.
>>>  
>>>  I could also deal directly with the .data, .indptr, and indices
>>>  attributes of csr_matrix, and see if it's possible to create a copy of
>>>  .data attribute and update the values accordingly. I was hoping
>>>  somebody had encountered this type of issue before.
>>>  
>>>  Sincerely,
>>>  
>>>  Basil Beirouti
>>>  -------------- next part --------------
>>>  An HTML attachment was scrubbed...
>>>  URL: 
>>> <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/8970d05a/attachment-0001.html>
>>>  
>>> 
>>>  
>>>  Message: 2
>>>  Date: Fri, 01 Jul 2016 17:35:49 -0400
>>>  From: Vlad Niculae
>>> <[email protected]>
>>>  To: Scikit-learn user and developer mailing list
>>>     <[email protected]>
>>>  Subject: Re: [scikit-learn] Adding BM25 to
>>>     scikit-learn.feature_extraction.text
>>>  Message-ID: <[email protected]>
>>>  Content-Type: text/plain; charset="utf-8"
>>>  
>>>  Hi Basil,
>>>  
>>>  If B were just a constant, you could do the whole thing as a vectorized 
>>> operation on X.data.
>>>  
>>>  Since I understand B is a n_samples vector, I think the cleanest way to 
>>> compute the denominator is using 
>>> sklearn.utils.sparsefuncs.inplace_row_scale.
>>>  
>>>  Hope this helps,
>>>  
>>>  Vlad
>>>  
>>>  
>>>>  On July 1, 2016 5:17:43 PM EDT, Basil Beirouti <[email protected]> 
>>>> wrote:
>>>>  Hi everyone,
>>>>  
>>>>  to put it succinctly, here's the BM25 equation:
>>>>  
>>>> 
>>>> f(w,D) * (k+1) / (k*B + f(w,D))
>>>>  
>>>>  where w is the word, and D is the document (corresponding to rows and
>>>>  columns, respectively). f is a sparse matrix because only a fraction of
>>>>  the
>>>>  whole vocabulary of words appears in any given single document.
>>>>  
>>>>  B is a function of only the document, but it doesn't matter, you can
>>>>  think
>>>>  of it as a constant if you want.
>>>>  
>>>>  The problem is since f(w,D) is almost always zero, I only need to do
>>>>  the
>>>>  calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when
>>>>  f(w,D) is not zero. Is there a clever way to do this with masks?
>>>>  
>>>>  You can refactor the above equation to get this:
>>>>  
>>>>  (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
>>>>  denominator, which is bad (because of dividing by zero).
>>>>  
>>>>  So anyway, currently I am converting to a coo_matrix and iterator
>>>>  through
>>>>  the non-zero values like this:
>>>>  
>>>>   
>>>> cx = x.tocoo()
>>>>    for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>>>>        (i,j,v)
>>>>  
>>>>  
>>>>  That iterator is incredibly fast, but unfortunately coo_matrix does
>>>>  not support assignment. So I create a new copy of either a dok sparse
>>>>  matrix or a regular numpy array and assign to that.
>>>>  
>>>>  I could also deal directly with the .data, .indptr, and indices
>>>>  attributes of csr_matrix, and see if it's possible to create a copy of
>>>>  .data attribute and update the values accordingly. I was hoping
>>>>  somebody had encountered this type of issue before.
>>>>  
>>>>  Sincerely,
>>>>  
>>>>  Basil Beirouti
>>>>  
>>>>  
>>>> 
>>>>  
>>>> 
>>>>  scikit-learn mailing list
>>>>  [email protected]
>>>>  https://mail.python.org/mailman/listinfo/scikit-learn
>>>  
>>>  -- 
>>>  Sent from my Android device with K-9 Mail. Please excuse my brevity.
>>>  -------------- next part --------------
>>>  An HTML attachment was scrubbed...
>>>  URL: 
>>> <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/ca1e4e96/attachment.html>
>>>  
>>> 
>>>  
>>>  Subject: Digest Footer
>>>  
>>> 
>>>  scikit-learn mailing list
>>>  [email protected]
>>>  https://mail.python.org/mailman/listinfo/scikit-learn
>>>  
>>>  
>>> 
>>>  
>>>  End of scikit-learn Digest, Vol 4, Issue 3
>>>  ******************************************
>> 
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Bm25

Reply via email to