I've been thinking about how to use named entities as metadata. How might I do
so intelligently?
Given sets of full text, it is possible to extract named entities from a
document. Named entities include things such as names of people, names of
organizations, names of places, etc. These things can be quite informative when
it comes to describing "aboutness".
Here at Notre Dame we have a collection of digitized Catholic pamphlets, and my
named entity extractor (a Python library called spaCy), can loop through each
sentence in a pamphlet and output named entities. Each pamphlet can have many
named entities, and each entity can be repeated many times. The extractor is
not perfect nor is the content. For example, the extractor might call a
particular entity a person when in reality it is an organization or a place.
After all, spaCy works off a model created through a machine learning process,
and consequently spaCy does not alway guess things correctly. On the other
hand, the pamphlets have been OCRed so the extracted entities are sometime
"mis-spelled".
The pamphlets collection includes close to 1,500 documents totaling 97 MB of
plain text. After extracting only the names of persons from the collection, I
have close to .5 million names (rows of data). Again, each name can be listed
more than once for each document.
My question is now two-fold. First, how do I go about normalizing ("cleaning")
my names. I could use OpenRefine to normalize things, but unfortunately,
OpenRefine does not seem to scale very well when it comes to .5 million rows of
data. OpenRefine's coolest solution for normalizing is its clustering
functions, and I believe I can rather easily implement a version of the
Levenshtein algorithm (one of the clustering functions) in any number of
computer languages including Python or SQL. Using Levenshtein I can then fix
the various mis-spellings.
Second, assuming my entities have been normalized, which ones do I actually
include as metadata? I could simply remove each of the duplicate entities
associated with a given file, and then add them all. This results in a whole
lot of names, and just because a name is mentioned one time does not
necessarily justify inclusion as metadata. I could then say, "If a name is
mentioned more than once, then it is justified for inclusion", but this policy
breaks down if a document is really long; the document is still not "about"
that name.
Instead, I think I need to implement some sort of weighting system. "Given all
the entities extracted from a set of documents, only include those names which
are (statistically) significant." I suppose I could implement some version of
TF/IDF to derive weight and significance. Hmmmm...
Have you extracted named entities from full text and then included them in your
metadata? If so, then how? What characteristics do the entries have to have to
justify their inclusion as metadata? Inquiring minds would like to know.
--
Eric Morgan