Ben,

Just got out of surgery for my broken leg; this email attempts to prove
that the general anesthesia didn't kill too many brain cells.  Its a report
on some the language-learning results.

Lets dive in.
(mst-parse-text "this is a test")

Raw hard-to-read result below.  Easier-to-read versions later.
ctv holds the raw count of how many times the word was observed.

((2.862118287646645 ((2 (WordNode "is" (ctv 1 0 8165736))) (3 (WordNode "a"
(ctv 1 0 14691104))
))) (2.1880378875282904 ((1 (WordNode "this" (ctv 1 0 1300681))) (2
(WordNode "is" (ctv 1 0 8165736))
))) (2.8103625339100944 ((1 (WordNode "this" (ctv 1 0 1300681))) (4
(WordNode "test" (ctv 1 0 60328))
))))

The floating point number above and below is the Yuret MI of the word pair.
I've amended
https://github.com/opencog/opencog/tree/master/opencog/nlp/learn/learn-lang-diary/learn-lang-diary.pdf
pages 2-5 so that its less confusing and the formulas are accurate.
Basically, it derives Yuret's formulas in a more rigorous
way; if I recall, his argument was scattered, and just asserted the result
without deriving it.  So the PDF derives it.

((2.862118287646645 ((2 (WordNode "is" )) (3 (WordNode "a" ))))
 (2.1880378875282904 ((1 (WordNode "this" )) (2 (WordNode "is" ))))
(2.8103625339100944 ((1 (WordNode "this" )) (4 (WordNode "test" )))))

Simplifying further:

(((2 (WordNode "is" )) (3 (WordNode "a" )))
 ((1 (WordNode "this" )) (2 (WordNode "is" )))
((1 (WordNode "this" )) (4 (WordNode "test" ))))

The integer is the ordinal of the word.  Note that the linkage "is-a" was
selected over "a test" -- that's because "a test" has an MI of 2.0935.
This is not terribly surprising; any MI of less than four is pretty crappy,
and these four words occur so commonly that the correlation between them
really is quite qeak -- they're almost drowning in noise.  Extracting
disjuncts should strongly sharpen the results.  Next email.

Here's a better example:

(mst-parse-text "cats eat cat food")

((7.329 ((2 (WordNode "eat" (ctv 20938))) (4 (WordNode "food" (ctv
73924)))))
(4.992 ((3 (WordNode "cat" (ctv 18408))) (4 (WordNode "food" (ctv 73924)))))
(-1000 ((1 (WordNode "cats" (ctv 5902))) (4 (WordNode "food" (ctv
73924))))))

So "eat food" has a decent MI, as expected.  Also "cat food" is decent. The
minus-1000 means that the word pair "cats food" was never observed.
(get-pair-mi-str "cats" "eat") = -1000 means that "cats eat" was never
observed!  Bummer! The word "cats" was observed 6000 times, and this was
not enough to discover a sentence that has "cats eat" in it.  These
statistics are from some relatively smallish sample of WP articles, so lack
of such a sentence is maybe not surprising. Here, childrens & young-adult
lit may be better.

Anyway, clustering that reveals that cats, dogs, etc are similar should
help with this, or so goes the hypothesis.

The word-pair "cats cat" does occur and has an MI of 5 but is prevented
from linking by the link-crossing constraint.  Have not attempted to figure
out if the Dick Hudson landmark transitivity idea can be mutated to apply
to this situation. I suppose I should think about things before writing
about them, but not thinking is faster.

Lets try again:

(mst-parse-text "dogs eat dog food")
((7.329 ((2 (WordNode "eat" (ctv 20938))) (4 (WordNode "food" (ctv
73924)))))
(7.047 ((3 (WordNode "dog" (ctv 41896))) (4 (WordNode "food" (ctv 73924)))))
(5.050 ((1 (WordNode "dogs" (ctv 14852))) (2 (WordNode "eat" (ctv 20938))
))))

Well, that's much better. Let's try something harder:

(mst-parse-text "It is not uncommon to discover strange things")

((7.515 ( (WordNode "not" ) (WordNode "uncommon" )))
 (4.142 ( (WordNode "is" ) (WordNode "uncommon" )))
 (4.412 ( (WordNode "It" ) (WordNode "is" )))
 (2.739 ( (WordNode "uncommon" ) (WordNode "to" )))
(3.529 ( (WordNode "to" ) (WordNode "discover" )))
(0.822 ( (WordNode "to" ) (WordNode "things" )))
(6.171 ( (WordNode "strange" ) (WordNode "things" ))))

Almost right -- the stinker in there is "to things" and it has a terrible
MI.  The correct link would have been "discover things" but this word-pair
was never ever observed.

That's it for now, more later.

p.s. The above is obtained with code that uses values in full
generality; so, for example, the normalized word frequency is stored as

(Valuation
    (WordNode "foo")
    (PredicateNode "*-Frequency Key-*")
    (FloatValue 0.1234567 3.018))

Note that "Valuation" is like an EvaluationLink but different.
The first number is the normalized frequency of observation N(foo) / N(all
words)
and the second number of log-base-2 of the first number (its easier to
read, than counting zeros in a frequency).

I had to fix a dozen bugs in brand-new SQL backend code to get this to
work right. It all seems stable, now.

--linas

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA35O8Cx%2BABGR0NmxeP7h9OBqR7jjYquK2F7vtQc1%3D2b%2BhA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to