[CODE4LIB] named entities as metadata

Eric Lease Morgan Fri, 29 Mar 2019 11:13:20 -0700

I've been thinking about how to use named entities as metadata. How might I do 
so intelligently?


Given sets of full text, it is possible to extract named entities from a 
document. Named entities include things such as names of people, names of 
organizations, names of places, etc. These things can be quite informative when 
it comes to describing "aboutness".

Here at Notre Dame we have a collection of digitized Catholic pamphlets, and my 
named entity extractor (a Python library called spaCy), can loop through each 
sentence in a pamphlet and output named entities. Each pamphlet can have many 
named entities, and each entity can be repeated many times. The extractor is 
not perfect nor is the content. For example, the extractor might call a 
particular entity a person when in reality it is an organization or a place. 
After all, spaCy works off a model created through a machine learning process, 
and consequently spaCy does not alway guess things correctly. On the other 
hand, the pamphlets have been OCRed so the extracted entities are sometime 
"mis-spelled".

The pamphlets collection includes close to 1,500 documents totaling 97 MB of 
plain text. After extracting only the names of persons from the collection, I 
have close to .5 million names (rows of data). Again, each name can be listed 
more than once for each document.

My question is now two-fold. First, how do I go about normalizing ("cleaning") 
my names. I could use OpenRefine to normalize things, but unfortunately, 
OpenRefine does not seem to scale very well when it comes to .5 million rows of 
data. OpenRefine's coolest solution for normalizing is its clustering 
functions, and I believe I can rather easily implement a version of the 
Levenshtein algorithm (one of the clustering functions) in any number of 
computer languages including Python or SQL. Using Levenshtein I can then fix 
the various mis-spellings.

Second, assuming my entities have been normalized, which ones do I actually 
include as metadata? I could simply remove each of the duplicate entities 
associated with a given file, and then add them all. This results in a whole 
lot of names, and just because a name is mentioned one time does not 
necessarily justify inclusion as metadata. I could then say, "If a name is 
mentioned more than once, then it is justified for inclusion", but this policy 
breaks down if a document is really long; the document is still not "about" 
that name.

Instead, I think I need to implement some sort of weighting system. "Given all 
the entities extracted from a set of documents, only include those names which 
are (statistically) significant." I suppose I could implement some version of 
TF/IDF to derive weight and significance. Hmmmm...

Have you extracted named entities from full text and then included them in your 
metadata? If so, then how? What characteristics do the entries have to have to 
justify their inclusion as metadata? Inquiring minds would like to know.

--
Eric Morgan

[CODE4LIB] named entities as metadata

Reply via email to