Wow! Thanks for  the teach-in, Linas. Very interesting, and you make it all as 
clear as it could be, I guess.

I think of your projects as an experiment whose goal is to find how far it's 
possible to get in learning a language using nothing but written records - a 
bit like deciphering a dead language simply by spotting patterns in the 
available corpus of texts. To some extent, your success will reflect the 
quality and 'depth' of the raw data, which (in the case of English texts) 
already reflect quite a sophisticated linguistic analysis thanks to the word 
spaces, the punctuation and the spelling (which distinguishes some homophones). 
I'm not sure what it will tell us about human language, but it will presumably 
tell us a lot about the limits of AI. Would you agree?

Anyway, I'm very impressed by what you and your colleagues in this field have 
achieved already.

Best wishes, Dick

On 18/11/2018 02:19, Linas Vepstas wrote:
Hi Dick,

Well, yes, but, "it depends". What you describe is found, more-or-less. There 
are some (relatively simple) mechanical processes that generate this.  Since 
they are mechanical, they are meant to be taken as non-judgmental, 
non-subjective lab instruments for examining syntax collected from nature, in 
the wild.  Like 17th century telescopes, they are blurry, and allow subjective 
interpretation.  You see something, but its not always clear what you see. 
Different ones give different views. There is a fairly broad selection, each of 
which gives different details, even as they agree on the overall structure.  
The good news is that they agree on the overall structure, and that this 
overall structure agrees with classical symbolic linguistics, at a general 
level; the game is now to get to the next level of detail.  The details are 
currently too blurry to say "ah ha, this linguist was exactly right, and that 
one was exactly wrong". It seems likely that everyone was a little-bit right, 
and a little-bit wrong. So it goes.

Let me give a concrete example, the MST example. This is a one-page recap of 
Deniz Yuret's PhD thesis, circa 1998. I  hope this is not too off-track.

Here, one starts with some reasonably large corpus, say wikipedia, or project 
gutenberg. (you eventually discover that wikipedia is very very deficient in 
action verbs, like run, jump, cry, sing, sail, think.  But that's for much much 
later. It does, however, affect the statistics very deeply.)

One then counts the co-occurrence of word-pairs. How often is word w seen to 
the left of word v, in a window of size 6 or 8 or so? (Window size mostly 
doesn't matter much).  Call this count N(w,v).  This is a "real" quantity, 
based on "facts", its a measurement of "reality". Corpus-dependent, but based 
on language captured in-the-wild.

Next: compute a magical quantity: the "point-wise mutual information", MI or 
PMI. I can explain/motivate why it's correct, or "best", just not here, not 
now. There are other possibilities, too, but the other ones are less coherent, 
they don't quite make sense.   The MI is a simple, explicit formula:

    MI(w,v) = log_2 N(w,v) N(*,*) / N(w,*) N(*,v)

where N(w,*) = sum-over-all-v N(w,v) and N(*,*) = sum-total of all word-pairs 
that were counted.  There is a very long history rooted in mathematics and 
physics and information theory that explains what MI is, and why it is a "good 
thing", suitable for this task. (That is, MI has nothing do do with language: 
it works for chemistry, too, and astronomy, etc. It's generic.)

For linguistics, MI is nice because ... when two words co-occur, it has a large 
value, and when they don't, it has a small (or negative) value. Typical range 
for MI is from minus 20 to plus 40 or so (depending on corpus size).  Examples:

     MI(Northern, Ireland) = +25
     MI(the, and) = -10

Yuret's Ansatz: we can, we should use MI to tell us which links in a dependency 
parse are the correct links.  The highest-MI links are correct, in some certain 
objective sense, and the lowest-MI ones are garbage, nonsense.

The algorithm: MST "Maximum Spanning Tree".  Take a sentence. Draw an edge that 
connects every possible word to every other, i.e. a clique, a big tangle, and 
then remove all links with the lowest MI until a tree is left.  (alternately, 
start with no edges at all, and add the highest-MI edge, then the second 
highest, etc. until you have a tree, and no unconnected words).  Then declare 
this to be the "correct parse", brush the dust off your overalls, and call it a 
day.  Here's what happens when you do this, and some critiques, and how to do 
better:

-- Yuret does this, and finds 85% accuracy or thereabouts, vs. a hand-annotated 
corpus. (Which I think needs to be acknowledged as a huge success! Viz: 
linguists are not hallucinating; the structure is "actually there", in "true 
reality".)
-- Prepositions cause problems for MST.
-- During the search for the tree, you can (arbitrarily) choose to reject 
crossing links. Or not.
-- During the search for the tree, you can arbitrarily choose to connect all 
words (this might not make sense for interjections, coughs, sneezes, non-verbal 
hand-motions, etc.)
-- During the search for the tree, you can explicitly exclude loops (but 
perhaps loops are desirable, so...)
-- The above did not describe a link from "root" to head-word. (there's a way 
to fix this).
-- The links are unlabeled: the algo does not tell you if they are subj, obj, 
etc.

The last criticism is perhaps the deepest, most significant.  I claim I know 
exactly how to get past it. Also, I claim I know how to get past the 85% 
accuracy.  I will not explain in this email, though.

The moral of the story:
-- One can objectively measure the existence of dependencies.
-- One has a lot of alternatives to explore (tree or loops allowed? cross or 
no-cross allowed? Use MI or use something else? (others have explored 
"something else", were less successful, but more famous. Standard story of fame 
and prestige in academia))
-- The MST or MST-like approaches are a way-point, not the final end-point. A 
step on the path.

Oh, I should mention: some of the neural-net stuff, like word2vec, GloVe, can 
be kind-of understandable as sort-of MST-like things, if you look at them the 
right way. There's a lot to be said, but it does offer a bridge between the 
"here" of symbolic linguistics, and the there of the deep-learning crowd, a 
unification of the two.

So, my ruminations about "shallow" and "deep" are more along these lines: Lets 
accept what MST does (or some variant of it, according to taste and evidence), 
and call this "shallow", so that "shallow" is a way-marker on the map, from 
here to there.  So, shallow is giving us some-kind of dependency parse, 
mostly-ish accurate, with deficiencies, but its "unarguable" because it is 
based on measured statistics. Variations of the algorithm give somewhat 
different results, but they are all in the same ballpark.

So what's the "deep structure"? Well, its the structure we want to actually 
have. Say, your life's work. Or perhaps Melcuk's MTT. Or maybe 
predicate-argument structure. Or Sowa's concept nets. Or some mashup of these. 
I don't particularly care: all I know is that it's the general direction for 
the next way-point on the journey.

How do we get there? Well, there has to be some relatively simple collection of 
formulas and algorithms that are mechanical in their action. The quality of 
these mechanisms will be judged on how closely they line up with the more 
sophisticated theories of syntax+semantics.  My laboratory bench has a bunch of 
these mechanisms laying about. I cannot assemble them and evaluate them fast 
enough. I am totally certain that they will work: preliminary evidence is very 
good, and besides, most or all of them are already based on tricks and 
techniques that many others have described, and have found to be useful and 
successful.

To get back to your example: it's not so simple, because it includes 
morphology, which I did not talk about, above. How can one find out that 
"rain", "rains" "rained" and "raining" are somehow the same word, sharing a 
stem, but with different suffixes? Well, there is a way to do this, but its 
another, different mechanism to be bolted on.   How can one discover that "it 
was raining" and "it rained" are vaguely synonymous? They don't even have the 
same word-count. Well, that is yet another mechanism, that goes elsewhere, 
attaching a different way. There's no particular graph to rule-them-all.  
There's a morpho-graph that draws an edge between "rain" and "ing".  There's a 
semantic graph that treats "wasraining" as a single unit.  There's a third 
graph that attaches "it" to it's referent. Except, for this example, "it" 
refers is a pleonastic-it to an implicit, non-specified imaginary place-time, 
rather than to some explicit word in a previous sentence. The three graphs are 
related, but have different functions, they illustrate different relationships.

-- Linas



On Sat, Nov 17, 2018 at 5:09 AM Hudson, Richard 
<[email protected]<mailto:[email protected]>> wrote:

Hello Linas. If you leave it to the learning mechanism, aren't you inevitably 
going to get crossed links? To take an even simpler example, "It was raining", 
your learning mechanism should work out three predictions:

  *   that "was" needs a subject (i.e. a preceding noun or pronoun).
  *   that any form of the verb RAIN needs the pronoun "it" as its subject (as 
in "It rained").
  *   that "was" needs (or at least accepts) an ing-form verb after it.

When you put these expectations together, you find a dependency triangle, with 
subject links from both verbs to "it" and dependency from "was" to "raining". 
Since both of the "it" links are the same ('subject'), there's no reason for 
assigning them to different levels of structure (deep vs surface), so you get a 
topological tangle.

Dick

On 16/11/2018 22:05, Linas Vepstas wrote:
I hit "send" too soon, without finishing the thought:

On Fri, Nov 16, 2018 at 3:02 PM Linas Vepstas 
<[email protected]<mailto:[email protected]>> wrote:
For example, this parse makes sense, and seems right:

     +-------->WV------->+
    +---->Wd-----+      |
    |      +Ds**c+-Ss*s-+---Pa--+
    |      |     |      |       |
LEFT-WALL the  dog.n was.v-d black.a

but there is another possibility, that kind-of makes sense (and perhaps 
language learning will find):

    +---->Wd---->+
    |            +-->adjcomp--->+
    |      +Ds**c+      +<-cop<-+
    |      |     |      |       |
LEFT-WALL the  dog.n   was    black

Here, adjcomp is "adjectival compliment" and "cop" was copula.  Some dependency 
grammars draw this graph. Some call it "predicative adjectival modifier". Lets 
quibble. Note that I did not draw an arrow from subject to verb. I could, I 
suppose.  Note that it is now IMPOSSIBLE to draw an arrow from root/left-wall 
to the verb, because it would require a
link-crossing, it would have to cross over the adjcomp arrow.

Thus, if you want to draw an arrow from root to head-verb, and also get a 
planar graph, you are not allowed to draw the adjcomp/predadj arrow.  That 
helps explain what LG does.

It also helps make clear that the no-links-crossing constraint is imperfect. It 
seems reasonable, but clearly, there is a violation in the above rather
trivial sentence!

OK, to finish this thought. Let us speculate what an MST parse of this sentence 
might be like. It depends on the MI values for the word-pairs MI(dog,was) 
MI(was,black) and MI(dog,black)  I don't know what these are, but clearly they 
will be different for a corpus of kids-lit, than a corpus of math texts.

Next question: what happens when words are sorted into categories?  What is 
MI(dog, some color)? What is MI(some animal, some color)? What is MI(physical 
object, some color)?

I don't have a good story here, except to say that copulas and predicative 
adjectives prsent maybe the simplest-possible example of a difficulty of moving 
from surface syntax (SSynt, what LG does) to deep syntax (DSynt, what MMT 
does). Yet, this move is a critical one.

I'm currently thinking of it as a graph-write rule, that converts the SSynt 
graph into a PLN graph

EvaluationLink
     PredicateNode "has color"
     ListLink
         Concept "dog"
         Concept "black"

Or, perhaps as Nil might like to write:

LambdaLink
     VariableList
          Variable $PHY
          Variable $COL
    AndLink
          EvaluationLink
              PredicateNode "has color"
              ListLink
                  Variable $PHY
                  Variable $COL
          InheritanceLink
                Variable $PHY
                Concept "physical object"
           InheritanceLink
                Variable $COL
                Concept "color"

Of course, even the above representation is wrong, in several ways, but 
nit-picking it at this stage is counter-productive.

The question is: given a learned grammar, with statistics, how to we get to the 
DSynt or the opencog variant?  Well, the now-quite-old Dekang Lin DIRT paper, 
and the newer-but-still-old Poon&Domingos unsupervised learning paper show the 
way.

Onward ho!

Linas
--
cassette tapes - analog TV - film cameras - you
--
You received this message because you are subscribed to the Google Groups 
"link-grammar" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To post to this group, send email to 
[email protected]<mailto:[email protected]>.
Visit this group at https://groups.google.com/group/link-grammar.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/link-grammar/CAHrUA36aRbObkgMmOGvxO2eGr0RV6pcwrkVBUR-yua_LOYNFSg%40mail.gmail.com<https://groups.google.com/d/msgid/link-grammar/CAHrUA36aRbObkgMmOGvxO2eGr0RV6pcwrkVBUR-yua_LOYNFSg%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.

--
Richard Hudson (dickhudson.com<http://dickhudson.com>)

[X]<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
 Virus-free. 
www.avg.com<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>


--
cassette tapes - analog TV - film cameras - you

--
Richard Hudson (dickhudson.com)

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/62283db9-b52d-2c50-4406-0110075d69cc%40ucl.ac.uk.
For more options, visit https://groups.google.com/d/optout.

Reply via email to