As promised: the email reply to Calvin's question, in the form of a blog
post.

-- linas

---------- Forwarded message ---------
From: OpenCog Brainwave <[email protected]>
Date: Sat, Jun 19, 2021 at 8:53 PM
Subject: [New post] Text Attribution with Link Grammar
To: <[email protected]>


Linas Vepstas posted: "A question was recently asked as to whether Link
Grammar could be used to attribute text to a specific author. I had fun
writing the reply; let me reproduce it below. It starts at square one.
Consider a police detective analyzing a threatening note. At s"

New post on *OpenCog Brainwave*
<https://blog.opencog.org/?author=5> Text Attribution with Link Grammar
<https://blog.opencog.org/2021/06/20/text-attribution-with-link-grammar/> by
Linas Vepstas <https://blog.opencog.org/?author=5>

A question was recently asked as to whether Link Grammar could be used to
attribute text to a specific author. I had fun writing the reply; let me
reproduce it below. It starts at square one.
Consider a police detective analyzing a threatening note. At some point in
the prior centuries, it becomes common knowledge that hand-written notes
are subject to forensic analysis. Criminals switch to typewriters; alas,
some famous spy cases from the 1940's are solved by linking notes to the
typewriters that produced them. By the 1970's, Hollywood shows us films
with the bad guy clipping words from newspapers. Aside from looking for
fingerprints left on the paper, psychological profilers look for
idiosyncracies in how the criminal expresses ideas. Stranger wording, odd
phrases, punctuation or lack thereof.
How about computer text? It's well know that many people consistently
mis-spell words (I consistently mis-spell "thier") and I think there was
some murder trial evidence that hinged on this. Moving into the PC era,
1980's onwards, we get postmodernism and corpus linguistics. One of the
fruits is the "bag of words" model: different texts have different ratios
of words. Although spotted much earlier, computers allow this to be applied
to a zillion and one problems in text classification. Basically, you have a
vector of (word, frequency) pairs and you can judge the similarity between
vectors with assorted distance measures (the dot-product is very popular
and also just plain wrong, but I digress.) I don't think you'd have any
particular problem with using this method to attribute a novel to James
Joyce, for example.
It becomes subtle, perhaps, if the text is short: say, a letter, and you
are comparing it to other letters written in the same era, written by
eloquent Irishmen. The words that Joyce might use in a letter might not be
the ones he'd use in a novel. It's reasonable to expect that bag-of-words
will fail to provide an unambiguous signal.  How about sentence structure,
then?   (This is what the original question was.) Yes, I agree: that is a
good way - maybe the best way(?) of doing this (at current levels of
technology). One might still expect Joyce to construct his sentences in a
way that is particular to his mode of thinking, irrespective of the topic
that he writes on. Mood and feeling echoes on in the grammar.
So, how might this work? Before I dive into that, a short digression.
Besides bag-of-words, there is also a bag of word-pairs. Here, you collect
not (word, frequency) pairs, but (word-pair, frequency) pairs. One collects
not nearest-neighbor word-pairs, but word-pairs in some window: say, of
length six. The problem is that there are vast numbers of word-pairs, like
"the-is" and "you-banana" - hundreds of millions. Most are junk. You can
weed most of these away by focusing only on those with a high mutual
information, but even so, you're left with the problem of "overfitting".
Enter the n-gram (as in "google n-gram viewer") or better yet, the
skipgram, which is an n-gram with some "irrelevant" words omitted.
Effectively all neural-net techniques are skip-gram based. To crudely
paraphrase what a neural net does: as you train it on a body of text (say
... James Joyce's complete works...), it develops a collection of (skigram,
frequency) pairs, or rather, a (skipgram, weight) vector. You can then
compare this to some unknown text: the neural net will act as a
discriminator or classifier, telling you if that other text is sufficiently
similar (often using the dot product, which is just plain... but I
digress...) The "magic" of the neural net is it figures out *which*
skip-grams are relevant, and which are noise/junk. (There are millions of
trillions of skip grams; out of these, the neural net picks out 200 to 500
of them. This is a non-trivial achievement).
How might this work for one of James Joyce's letters? Hmm. See the problem?
If the classifier is trained on his novels, the vocabulary there might be
quite different than the vocabulary in his personal letters, and that
difference in vocabulary will mess up the recognition.  Joyce may be using
the same sentence constructions in his letters and novels, but with a
different vocabulary in each. A skip-gram classifier is blind to
word-classes: it's blind to the grammatical constructions.  Something as
basic as a synonym trips it up. (Disclaimer: there is some emerging
research into solving these kinds of problems for neural nets, and I am
*not* up on the latest! Anyone who knows better is invited to amplify!)
I've said before (many many times) that skip-grams are like Link Grammar
disjuncts, and it's time to make this precise. Lets try this:

    +---->WV--->+     +-----IV--->+-----Ost-----+
    +->Wd--+-SX-+--Pa-+--TO--+-Ixt+   +--Dsu*v--+
    |      |    |     |      |    |   |         |
LEFT-WALL I.p am.v proud.a to.r be.v an emotionalist.n

Here, an example skipgram might be (*I..proud..be*) or (
*proud..be..emotionalist*) A sentence like "*I was immodestly an
emotionalist*" would be enough for a police detective to declare that Joyce
wrote that. Yet, there is no skip-gram match.
Consider now the Link-grammar word-disjunct pairs. For the above sentence,
here's the complete list:
               I  == Wd- SX+
              am  == SX- dWV- Pa+
           proud  == Pa- TO+ IV+
              to  == TO- I*t+
              be  == Ix- dIV- O*t+
              an  == Ds**v+
    emotionalist  == D*u- Os-
You can double-check this by carefully looking at the diagram above; notice
that "*proud*" links to the left with Pa and to the right with TO and IV.
The original intent of disjuncts is to indicate grammatical structure. So, "
Pa" is a "predicative adjective". "IV" links to "infinitive verb".  As a
side-effect, they work with word-classes: for example, "*He was happy to be
an idiot*" has exactly the same parse, even though the words are quite
different.
To finally get back to the original question of author attribution. Well,
here's an idea: "bag of disjuncts". Let's collect (disjunct, frequency)
pairs from Joyce's novels, and compare them to his letters. The motivation
for this idea is that perhaps the specific vocabulary words are different,
but the sentence structures are similar.
How well does this work? I dunno. No one has ever studied this in any
quantitative, scientific setting.  Some failings are obvious: There is a
100% match to "*He was happy to be an idiot*" even though the word-choice
might not be Joycian. There is a poor match to "*I was immodestly an
emotionalist*" even though the word "*emotionalist*" is extremely rare, and
a dead-giveaway. There's also a problem with the correspondence "
*immodestly*" <=> "*proud to be*" because "*immodestly*" is an adverb, not
a predicative adjective, and it's a single word, not a word-phrase. Raw,
naive Link Grammar is insensitive to synonymy between word-phrases.
There is a two-decade old paper that explains exactly how to solve the
multi-word synonymous-phrases problem. It's been done. It's doable. I can
certainly point out a half-dozen other tricks and techniques to further
refine this process.  So, yes, I think that this all provides a good
foundation for text attribution experiments. But I mean what I say:
"experiments". I think it could work, and I think it might work quite well.
But, to do better, you'd have to actually do it. Try it.  It would take a
goodly amount of work before any literary critic would accept your results;
and even more before a judge would accept it as admissible evidence in a
court of law.
As to existing software: I have a large collection of tools for counting
things and pairs of things, and comparing the similarity of vectors. Most
enthusiasts would find that code unusable, until it gets re-written in
python. Alas, that is not forthcoming.  If you wanted to actually do what I
describe above, some very concrete plans would need to be made.
I also have this daydream about *generating text* in the style of a given
author: given a corpus, create more sentences and paragraphs, in the style
and vocabulary of that corpus. My ideas for this follow along similar lines
of thought to the above, but this is ... a discussion for some other day.
*Linas Vepstas <https://blog.opencog.org/?author=5>* | June 20, 2021 at
1:53 am | Categories: Uncategorized
<https://blog.opencog.org/?taxonomy=category&term=uncategorized> | URL:
https://wp.me/p9hhnI-d1

Comment
<https://blog.opencog.org/2021/06/20/text-attribution-with-link-grammar/#respond>
   See all comments
<https://blog.opencog.org/2021/06/20/text-attribution-with-link-grammar/#comments>

Unsubscribe
<https://public-api.wordpress.com/bar/?stat=groovemails-events&bin=wpcom_email_click&redirect_to=https%3A%2F%2Fsubscribe.wordpress.com%2F%3Fkey%3Da9a418716d1b232b4d6f1b1829be75ae%26email%3Dlinasvepstas%2540gmail.com%26b%3D1X9MyDpODq-mJ2hnjfxBbZq9hBFT0sGXLkKWpMYEFjMJLEqxDVSqA0URsDeNF-tdAIfBgNX7jOfCETOimRg_O_gtPvUKABtoQ3r3q3hKGAwesj4%253D&sr=1&signature=bca89447c7fc212f144a57c1e8812fc0&user=3747872&_e=eyJlcnJvciI6bnVsbCwiYmxvZ19pZCI6MTM3MTA1NDE4LCJibG9nX2xhbmciOiJlbiIsInNpdGVfaWRfbGFiZWwiOiJqZXRwYWNrIiwiZW1haWxfbmFtZSI6ImVtYWlsX3N1YnNjcmlwdGlvbiIsIl91aSI6Mzc0Nzg3MiwiZW1haWxfaWQiOiJkMjNkYmQ5MDAzZDlhMTYyYjhiNGZiMTMwMDk1MmQ1ZCIsImRhdGVfc2VudCI6IjIwMjEtMDYtMjAiLCJsb2NhbGUiOiJlbiIsImN1cnJlbmN5IjoiVVNEIiwiY291bnRyeV9jb2RlX3NpZ251cCI6IlVTIiwiZG9tYWluIjoiYmxvZy5vcGVuY29nLm9yZyIsImZyZXF1ZW5jeSI6IjAiLCJkaWdlc3QiOiIwIiwiaGFzX2h0bWwiOiIxIiwiYW5jaG9yX3RleHQiOiJVbnN1YnNjcmliZSIsIl9kciI6bnVsbCwiX2RsIjoiXC94bWxycGMucGhwP3N5bmM9MSZjb2RlYz1kZWZsYXRlLWpzb24tYXJyYXkmdGltZXN0YW1wPTE2MjQxNTM5ODguNTkyOSZxdWV1ZT1zeW5jJmhvbWU9aHR0cHMlM0ElMkYlMkZibG9nLm9wZW5jb2cub3JnJnNpdGV1cmw9aHR0cHMlM0ElMkYlMkZibG9nLm9wZW5jb2cub3JnJmNkPTAuMDAwOSZwZD0wLjAwMjAmcXVldWVfc2l6ZT00JmJ1ZmZlcl9pZD02MGNlOWY4NDkwMTIyJnRpbWVvdXQ9MTUmZm9yPWpldHBhY2smd3Bjb21fYmxvZ19pZD0xMzcxMDU0MTgiLCJfdXQiOiJ3cGNvbTp1c2VyX2lkIiwiX3VsIjoibGluYXN2IiwiX2VuIjoid3Bjb21fZW1haWxfY2xpY2siLCJfdHMiOjE2MjQxNTM5OTEzMzcsImJyb3dzZXJfdHlwZSI6InBocC1hZ2VudCIsIl9hdWEiOiJ3cGNvbS10cmFja3MtY2xpZW50LXYwLjMiLCJibG9nX3R6IjoiMCIsInVzZXJfbGFuZyI6ImVuIn0&_z=z>
to no longer receive posts from OpenCog Brainwave.
Change your email settings at Manage Subscriptions
<https://public-api.wordpress.com/bar/?stat=groovemails-events&bin=wpcom_email_click&redirect_to=https%3A%2F%2Fsubscribe.wordpress.com%2F%3Fkey%3Da9a418716d1b232b4d6f1b1829be75ae%26email%3Dlinasvepstas%2540gmail.com&sr=1&signature=ece1c66e405464c8308b765ebceb3f20&user=3747872&_e=eyJlcnJvciI6bnVsbCwiYmxvZ19pZCI6MTM3MTA1NDE4LCJibG9nX2xhbmciOiJlbiIsInNpdGVfaWRfbGFiZWwiOiJqZXRwYWNrIiwiZW1haWxfbmFtZSI6ImVtYWlsX3N1YnNjcmlwdGlvbiIsIl91aSI6Mzc0Nzg3MiwiZW1haWxfaWQiOiJkMjNkYmQ5MDAzZDlhMTYyYjhiNGZiMTMwMDk1MmQ1ZCIsImRhdGVfc2VudCI6IjIwMjEtMDYtMjAiLCJsb2NhbGUiOiJlbiIsImN1cnJlbmN5IjoiVVNEIiwiY291bnRyeV9jb2RlX3NpZ251cCI6IlVTIiwiZG9tYWluIjoiYmxvZy5vcGVuY29nLm9yZyIsImZyZXF1ZW5jeSI6IjAiLCJkaWdlc3QiOiIwIiwiaGFzX2h0bWwiOiIxIiwiYW5jaG9yX3RleHQiOiJNYW5hZ2UgU3Vic2NyaXB0aW9ucyIsIl9kciI6bnVsbCwiX2RsIjoiXC94bWxycGMucGhwP3N5bmM9MSZjb2RlYz1kZWZsYXRlLWpzb24tYXJyYXkmdGltZXN0YW1wPTE2MjQxNTM5ODguNTkyOSZxdWV1ZT1zeW5jJmhvbWU9aHR0cHMlM0ElMkYlMkZibG9nLm9wZW5jb2cub3JnJnNpdGV1cmw9aHR0cHMlM0ElMkYlMkZibG9nLm9wZW5jb2cub3JnJmNkPTAuMDAwOSZwZD0wLjAwMjAmcXVldWVfc2l6ZT00JmJ1ZmZlcl9pZD02MGNlOWY4NDkwMTIyJnRpbWVvdXQ9MTUmZm9yPWpldHBhY2smd3Bjb21fYmxvZ19pZD0xMzcxMDU0MTgiLCJfdXQiOiJ3cGNvbTp1c2VyX2lkIiwiX3VsIjoibGluYXN2IiwiX2VuIjoid3Bjb21fZW1haWxfY2xpY2siLCJfdHMiOjE2MjQxNTM5OTEzMzgsImJyb3dzZXJfdHlwZSI6InBocC1hZ2VudCIsIl9hdWEiOiJ3cGNvbS10cmFja3MtY2xpZW50LXYwLjMiLCJibG9nX3R6IjoiMCIsInVzZXJfbGFuZyI6ImVuIn0&_z=z>.


*Trouble clicking?* Copy and paste this URL into your browser:
https://blog.opencog.org/2021/06/20/text-attribution-with-link-grammar/



-- 
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA345XxUC8C5UJwco03O6K0ASJg_Pv%3DjRgnsK2Qsgr-Dpkg%40mail.gmail.com.

Reply via email to