I'm not familiar with the APIs in question, but--if I'm looking at this right, 
your CSR matrix (tfidf) looks like it would have columns corresponding with 
topics and rows corresponding with documents. If that's the case, you could 
maybe do something like this:

1. Use tfidf.getcol() to get the column corresponding to your chosen topic. 
Looks like that should give you a 1-dimensional matrix of all document scores 
for that topic.
2. Cast that to an array of scores using .toarray(), and then a list with 
.tolist(). (I think?)
3. Use a list comprehension and "enumerate" to generate explicit doc IDs based 
on each document's position in the list, creating a list of 2-element lists or 
tuples, (doc_id, score). While you're at it, you could filter the list 
comprehension to give you only the documents with scores that are greater than 
0, or some other threshold.
4. Pass the results through the built-in "sorted" function to sort your list of 
tuples based on score.

>>> topic = 9497
>>> score_thresh = 0
>>> topic_scores = tfidf.getcol(topic).toarray().tolist()
>>> docs_and_scores = [(score[0], score[1]) for score in 
>>> enumerate(topic_scores) if item[1] > score_thresh]
>>> most_relevant_docs = sorted(docs_and_scores, key=lambda x: x[1])

The resulting "most_relevant_docs" variable should be a list of tuples that 
looks something like this (for example):
[(102, 0.9), (33, 0.875), (365, 0.874), ...]

Not sure if that's helpful...? There's probably a more numpy/scipy way of doing 
the above using actual numpy array methods (especially the 4th line).

Jason


-----Original Message-----
From: Code for Libraries [mailto:[email protected]] On Behalf Of Eric 
Lease Morgan
Sent: Tuesday, September 26, 2017 12:33 PM
To: [email protected]
Subject: [EXT] Re: [CODE4LIB] accessing a python compressed sparse row format 
object

On Sep 26, 2017, at 1:28 PM, Andromeda Yelton <[email protected]> 
wrote:

>> Does anybody here know how to access a Python compressed sparse row format
>> (CSR) object? [1]
>> 
>> [1] CSR - http://bit.ly/2fPj42V
> 
> Do you have a link to the code you're using?


Yes, thank you. See —> 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py  —ELM

Reply via email to