On Sep 16, 2019, at 12:20 PM, Athina Livanos-Propst <[email protected]> 
wrote:

> I'm starting to think around a project that would involve key terms from 
> other types of text (transcripts, captions, documents). I'm basically trying 
> to build a tool that I can use to extra key terms from larger strings of 
> text, i.e. pull out the important words from a larger sentence.


Fun!

Keyword extraction comes in many forms, each with their own strengths & 
weaknesses, advantages & disadvantages, but to do any such task, one's data 
MUST be transformed into plain text:

  * Frequencies - Counting & tabulating each & every ngram in a text, sans stop 
words, is a good place to start. It is rather unsophisticated, especially for 
uni-grams (one-word "phrases"), but the frequency of bi-grams, tri-grams, etc, 
can be quite insightful. If the frequency of uni-grams is extracted, consider 
identifying the lemma of each word to get a more holistic picture of the 
document in question.

  * TFIDF - Many relevancy ranking algorithms are rooted in TFIDF 
(term-frequency / inverse-document frequency), and TFIDF considers the 
frequency of a word, the size of the document, the number of times the word 
appears in the entire corpus, and the number of documents in the corpus. 
Calculating the TFIDF score for each word in a document, and then setting a 
significance threshold is a well-understood method of keyword extraction. 
Here's a pointer to a Perl program doing such work --> 
https://github.com/ericleasemorgan/reader/blob/master/bin/classify.pl

  * TextRank - This algorithm created Google, and it is probably what you want 
to use, especially since there is a handy-dandy Python library which implements 
it --> https://radimrehurek.com/gensim/summarization/keywords.html  Here is an 
example Python script:

  #!/usr/bin/env python

  # txt2keywords.py - given a file, output a list of keywords


  # configure; increase or decrease to change the number of desired output words
  RATIO = 0.01

  # require
  from gensim.summarization import keywords
  import sys

  # sanity check
  if len( sys.argv ) != 2 :
      sys.stderr.write( 'Usage: ' + sys.argv[ 0 ] + " <file>\n" )
      quit()

  # initialize
  file = sys.argv[ 1 ]

  # slurp up the given file
  text = open( file, 'r' ).read()

  # process each keyword; can't get much simpler
  for keyword in keywords( text, ratio=RATIO, split=True, lemmatize=True ) : 
print( keyword )

  # done
  quit()


<plug>By the way, the Distant Reader does all of this sort of work, and more 
--> https://distantreader.org  Feed the reader a set of files, and it will 
compute keywords, extract parts-of-speech & named-entities, summarize your 
documents, etc. Sample outputs are here --> 
http://carrels.distantreader.org</plug>

-- 
Eric Lease Morgan
Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
University of Notre Dame

Reply via email to