Re: [CODE4LIB] accessing a python compressed sparse row format object [topic modeling; resolved]

2017-09-28 Thread Eric Lease Morgan
On Sep 27, 2017, at 8:18 AM, Eric Lease Morgan  wrote:

> Does anybody here know how to access a Python compressed sparse row 
> format (CSR) object? -> http://bit.ly/2fPj42V
 
 Do you have a link to the code you're using?
>>> 
>>> Yes, thank you. See —> 
>>> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py  —ELM
>> 
>> I'm not familiar with the APIs in question, but--if I'm looking at this 
>> right, your CSR matrix (tfidf) looks like it would have columns 
>> corresponding with topics and rows corresponding with documents...
>> 
>> Jason, this is REALLY close, and I have begun to include it at the very end 
>> of my code. Thank you! ‘More later.


Jason’s suggestions were very helpful, and after hacking on my topic modeling 
program I am giving the program a label of version 1.0. See -> 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py

To resolve my issues I had to:

  1. learn that my vectorizer (TfidfVectorizer) could take
 a list of file names as input, thus the file names
 come along for the ride in the resulting matrices 

  2. exploit Jason’s suggestion to extract the file names
 from a list of ranked (sorted) topics

If you have a directory of plain text files, and if you have Python (as well as 
all of its friends) installed, then you can run the program something like this:

  $ ./topic-model.py ./shamanism/text/ 5 5 3

The result will be a list of topics (think “subject terms”) and their most 
associated files:

  * god; church; christian; divine; christ
o ./shamanism/text/uva.x002756372.txt
o ./shamanism/text/uc1.$b43226.txt
o ./shamanism/text/mdp.39015062241685.txt

  * god; gods; primitive; spirits; worship
o ./shamanism/text/hvd.ah59xi.txt
o ./shamanism/text/uc2.ark+=13960=t3mw2kc4t.txt
o ./shamanism/text/mdp.39015025016869.txt

  * social; cultural; culture; group; economic
o ./shamanism/text/uc1.b4558415.txt
o ./shamanism/text/uc1.$b604512.txt
o ./shamanism/text/mdp.39015003464057.txt

  * russian; siberia; river; south; russia
o ./shamanism/text/uc2.ark+=13960=t1qf8s666.txt
o ./shamanism/text/mdp.39015068416885.txt
o ./shamanism/text/njp.32101068979754.txt

  * cf; hebrew; el; text; babylonian
o ./shamanism/text/nyp.33433081840559.txt
o ./shamanism/text/wu.89097203632.txt
o ./shamanism/text/uc2.ark+=13960=t84j0cs6h.txt

What’s really cool is that I can now search my corpus for terms like god, 
church, or Christian, and the ranked files float to the top; this particular 
topic modeling process works for me, and now I can “turn the knobs” to improve 
the results as well as consider plotting the results on a Cartesian plane to 
visualize similarity.

Fun with text mining. 

—
Eric Lease Morgan
University of Notre Dame


Re: [CODE4LIB] accessing a python compressed sparse row format object

2017-09-27 Thread Eric Lease Morgan
On Sep 26, 2017, at 4:41 PM, Thomale, Jason  wrote:

 Does anybody here know how to access a Python compressed sparse row format 
 (CSR) object? [1]
 
 [1] CSR - http://bit.ly/2fPj42V
>>> 
>>> Do you have a link to the code you're using?
>> 
>> Yes, thank you. See —> 
>> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py  —ELM
> 
> I'm not familiar with the APIs in question, but--if I'm looking at this 
> right, your CSR matrix (tfidf) looks like it would have columns corresponding 
> with topics and rows corresponding with documents. If that's the case, you 
> could maybe do something like this:
> 
>   1. Use tfidf.getcol() to get the column corresponding
>  to your chosen topic. Looks like that should give you a
>  1-dimensional matrix of all document scores for that
>  topic.
> 
>   2. Cast that to an array of scores using .toarray(),
>  and then a list with .tolist(). (I think?)
> 
>   3. Use a list comprehension and "enumerate" to generate
>  explicit doc IDs based on each document's position in
>  the list, creating a list of 2-element lists or tuples,
>  (doc_id, score). While you're at it, you could filter
>  the list comprehension to give you only the documents
>  with scores that are greater than 0, or some other
>  threshold.
> 
>   4. Pass the results through the built-in "sorted"
>  function to sort your list of tuples based on score.
> 

> >>> topic = 9497
> >>> score_thresh = 0
> >>> topic_scores = tfidf.getcol(topic).toarray().tolist()
> >>> docs_and_scores = [(score[0], score[1]) for score in 
> >>> enumerate(topic_scores) if item[1] > score_thresh]
> >>> most_relevant_docs = sorted(docs_and_scores, key=lambda x: x[1])
> 
> The resulting "most_relevant_docs" variable should be a list of tuples that 
> looks something like this (for example):
> [(102, 0.9), (33, 0.875), (365, 0.874), ...]
> 
> Not sure if that's helpful...? There's probably a more numpy/scipy way of 
> doing the above using actual numpy array methods (especially the 4th line).


Jason, this is REALLY close, and I have begun to include it at the very end of 
my code. Thank you! ‘More later. code4lib++  —Eric Morgan


Re: [CODE4LIB] accessing a python compressed sparse row format object

2017-09-26 Thread Eric Lease Morgan
On Sep 26, 2017, at 1:28 PM, Andromeda Yelton  
wrote:

>> Does anybody here know how to access a Python compressed sparse row format
>> (CSR) object? [1]
>> 
>> [1] CSR - http://bit.ly/2fPj42V
> 
> Do you have a link to the code you're using?


Yes, thank you. See —> 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py  —ELM


Re: [CODE4LIB] accessing a python compressed sparse row format object

2017-09-26 Thread Andromeda Yelton
Do you have a link to the code you're using?

On Tue, Sep 26, 2017 at 1:25 PM, Eric Lease Morgan  wrote:

> Does anybody here know how to access a Python compressed sparse row format
> (CSR) object? [1]
>
> I am using Python to do a bit of topic modeling (think “classification”),
> and so far, the results are more than plausible, but the results only
> return topics not documents corresponding to the topics. Along the way, my
> script creates a compressed sparse row format object, and it looks
> something like this:
>
>   (0, 16099)0.055924002143
>   (0, 9497) 0.0256051292226
>   (0, 16202)0.140746540109
>   (0, 38982)0.000842900625312
>   : :
>   (309, 40805)  0.0435077792741
>   (309, 45679)  0.0435077792741
>   (309, 19462)  0.0435077792741
>   (309, 8346)   0.0435077792741
>   (309, 31204)  0.0435077792741
>
> Where the first column denotes a document identifier, the second column
> denotes a topic identifier, and the third column denotes the score of the
> topic in the document. In the example above, document #0 is a lot about
> topic #16202 but not a lot about topic #38982.
>
> I want to query my CSR object. For example, given a topic identifier (ie.
> 48692), return a list of all document identifiers and scores from the
> object. I will then sort the scores to find which documents which most
> significantly use the given topic.
>
> I can’t for the life of me figure out how to get what I need. I can get
> specific values of rows like this where tfidf is my CRS object:
>
>   >>> print( tfidf[ 309, 31204 ] )
>   >>> 0.0435077792741
>
> Any help would be greatly appreciated.
>
> [1] CSR - http://bit.ly/2fPj42V
>
> —
> Eric Morgan
>



-- 
Andromeda Yelton
Senior Software Engineer, MIT Libraries: https://libraries.mit.edu/
President, Library & Information Technology Association: http://www.lita.org
http://andromedayelton.com
@ThatAndromeda 


[CODE4LIB] accessing a python compressed sparse row format object

2017-09-26 Thread Eric Lease Morgan
Does anybody here know how to access a Python compressed sparse row format 
(CSR) object? [1]

I am using Python to do a bit of topic modeling (think “classification”), and 
so far, the results are more than plausible, but the results only return topics 
not documents corresponding to the topics. Along the way, my script creates a 
compressed sparse row format object, and it looks something like this:

  (0, 16099)0.055924002143
  (0, 9497) 0.0256051292226
  (0, 16202)0.140746540109
  (0, 38982)0.000842900625312
  : :
  (309, 40805)  0.0435077792741
  (309, 45679)  0.0435077792741
  (309, 19462)  0.0435077792741
  (309, 8346)   0.0435077792741
  (309, 31204)  0.0435077792741

Where the first column denotes a document identifier, the second column denotes 
a topic identifier, and the third column denotes the score of the topic in the 
document. In the example above, document #0 is a lot about topic #16202 but not 
a lot about topic #38982.

I want to query my CSR object. For example, given a topic identifier (ie. 
48692), return a list of all document identifiers and scores from the object. I 
will then sort the scores to find which documents which most significantly use 
the given topic.

I can’t for the life of me figure out how to get what I need. I can get 
specific values of rows like this where tfidf is my CRS object:

  >>> print( tfidf[ 309, 31204 ] )
  >>> 0.0435077792741

Any help would be greatly appreciated.

[1] CSR - http://bit.ly/2fPj42V

—
Eric Morgan