Re: [opencog-dev] Re: 100 sentences for GC

'Nil Geisweiller' via opencog Sun, 31 Mar 2019 23:51:22 -0700

Hi,

On 4/1/19 5:17 AM, Linas Vepstas wrote:

    But somehow, I suspect... Isn't this why OpenCog has "unified rule
engine" (URE) instead of link grammar at its core,
No. It has the rule-engine because back then, I did not understandsheaves. I'm starting to think that the rule engine is a strategicmistake. The original idea is that rule-application is the mainconceptual abstraction of term-rewriting. One rewrites, or provestheorems by applying sequences of rules. It turns out that discoveringthe right sequence is hard. Finding correct long sequences is hard - acombinatorial explosion.


So is writing programs, yet MOSES still manages to do that and
produce useful models. BTW, writing inference trees (which the URE
does) is almost exactly equivalent to writing programs.

The openpsi system addresses some of these issues. Unfortunately, it'scurrent implementation is a tangle of rule-selection mechanisms, andtheories of human psychology. It's probably better than the URE, but iscurrently not as powerful.


OpenPsi and the URE are 2 different systems, doing different
things. OpenPsi is an action selection mechanism to fulfill urges and
create plans, the URE is a inference tree builder. Though actually
both may need each other. The inference control mechanism in the URE
uses a specialized re-implementation of OpenPsi, and OpenPsi could use
the URE to build plans.

I'm trying to place a theory of sheaves as a replacement for URE, and asthe natural generalization of openpsi, but I've successfullyself-sabotaged myself in these efforts.


Linas, what is a good starting point to understand what you're trying
to accomplish?

Here?https://github.com/opencog/atomspace/blob/master/opencog/sheaf/README.md

Nil


    and with URE things get much more complicated. I'm sorry, but that

is still a Gordian knot to me, considering all of my modest knowledge.


We all have modest knowledge. That is the nature of the human condition.

    On the other hand, if someone really smart would provide automatic
    grammar extraction by means of unrestricted grammar
    <https://en.wikipedia.org/wiki/Unrestricted_grammar>, I believe that
    would be it.

Yes, that is the goal of the language-learning project. However, asnoted in my last email (on the link-grammar list) it is not enough tojust learn a semi-Thue system, declare victory, and go home. Theexample I gave there:


   "I think that you should give that car a second look"
   "you should really give that song a second listen"
   "maybe you should give Sue a second chance".

Learning to parse these "set phrases" or phrasemes is equivalent tolearning a semi-Thue system; however, its not enough to realize that allthree are forms of advice-giving, having "conserved" or "fixed" regions"x YOU SHOULD y GIVE z SECOND w" where z is very highly variable havingmillions of variations, and w only has a few dozen allowed variations.Note that the words "fixed", "conserved", "variable" are words used ingenetics and proteomics and antibody structure. Its the same idea.

The goal of learning lexical functions (LF's) is to learn that all threeare advice-giving forms, and also to learn what is, and what can beplugged in for x,y,z,w. So, although a super-whiz-bang grammar learnercapable of learning context-sensitive languages should be able to learn"x YOU SHOULD y GIVE z SECOND w", it still will not know the *meaning*of this phrase. To know the *meaning*, you have to know the acceptableranges (as fuzzy-sets) of x,y,z,w.

To conclude, thinking about Turing-completeness is a waste of time,because Turing completeness only tells you that "x YOU SHOULD y GIVE zSECOND w" is recursively enumerable; it does not tell you what itactually means.

Put another way: having a universal Turing machine is not the same asknowing how some particular program works. Automagically learning acontext-sensitive grammar is not enough to know what that grammar is"saying/doing".


-- Linas


    Thank you,
    Ivan V.


    čet, 28. ožu 2019. u 07:58 Anton Kolonin @ Gmail <[email protected]
    <mailto:[email protected]>> napisao je:

        Ben, Linas,

         >But we know that MST parsing is shit.  Stop wasting time on
        MST or trying to "improve" it.

        I think that sounds like kind of support for the concept of
        "dumb explosive parsing" being advocated for 1+ year ago:

        
https://docs.google.com/document/d/14MpKLH5_5eVI39PRZuWLZHa1aUS73pJZNZzgigCWwWg/edit#heading=h.aqo9bumb3doy

        I also agree we other Linas'es reasoning in this thread. I would
        consider giving it a try starting next month if we don't have a
        breakthrough with DNN-MI-milking-based-MST-Parsing by that time.

         > can be done generically, and not just on language

        I think everyone in bio-informatics dreams of extracting secrets
        of "dark side of the genome" with something like that ;-)

        Cheers,

        -Anton


        28.03.2019 1:24, Linas Vepstas пишет:

        Hi Anton,

        I've cc'ed the link-grammar mailing list, because I describe
        below some concepts for word-sense disambiguation. I'm also
        cc'ing the opencog mailing list and ivan vodisek, because
        after studying hilbert systems, I think he's ready to think
        about how knowledge extraction can be done generically, and
        not just on language.

        -- Linas

        On Mon, Mar 25, 2019 at 1:39 AM Anton Kolonin @ Gmail
        <[email protected] <mailto:[email protected]>> wrote:

            Hi Linas,

            >I'd call it "interesting", but maybe not "golden"

            These are randomly selected sentences from "Gutenberg
            Children" corpus:

            
http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/

            "Gutenberg Children silver standard" is LG-English parses:

            
http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GCB-LG-English-clean.ull

            "Gutenberg Children gold standard" is subset of "silver
            standard" with semi-random selection of sentences skipping
            direct speech and doing manual verification of the links.

            So as long as we are training on "Gutenberg Children"
            corpus, having the test on the same "Gutenberg Children"
            seems reasonable, right?


        Yes. You still need to verify that each word in the "golden"
        corpus occurs at least N=10 or 20 times in the training
        corpus. The dependency of accuracy on N is not generally
        known, but it is very clear that if a word occurs only N=3
        times in the training corpus, then whatever is learned about
        it will be very low quality.

            But thanks, we may have put mire effort in removal of
            ancient constructions and words even if these are present
            in the corpus.

        If you consistently train on 19th century literature, and then

evaluate 19th-century literature comprehension, that's fine.Just don't expect it to work for 21st century blog posts.


        The strongest effect will be the N=number of observations effect.

            >Anyway -- you only indicate pair-wise word-links. Is the
            omission of disjuncts intentional?

            If you have all links in the sentence, you can construct
            all of the disjuncts with o ambiguity, correct?

        No, but only because you did not indicate the link-type.  The
        whole point of a clustering step is to obtain a link-type; if
        you discard it, you will never get  better-than-MST results.
        The link-type is critical for obtaining the word-classes.  The
        whole point of learning is to learn the word-classes; you've
        learned very little, if you know only word-pairs.

        Consider this example:

        I saw wood
        I saw some wood

        A solution that would be "almost perfect" (or "golden") would
        be this:

        saw: {performer-of-actions}- & {sculptable-mass}+;
        saw: {observer}-  & {viewable-thing}+;

        These disambiguate the two different senses of the word
        "saw".  It's impossible to have word-sense disambiguation
        without actually having these disjuncts.  The word-pairs alone
        are not sufficient to report the link-type connecting the
        words.  Clustering gives the other dictionary entries:

        I: {performer-of-actions}+ or {observer}+;
        wood: {sculptable-mass}- or ({quantity-determiner}- &
        {viewable-thing}-);
        some: {quantity-determiner}+;

        Thus, the pronoun "I" also belong to two different word-sense
        categories: performers and observers.  Compare to:

        "The chainsaw saws wood"  -- a "chainsaw" can be  a "performer
        of actions" but cannot be an "observer".
        "The dog saw some wood" -- dogs can be observers. They can
        perform some actions; like run, jump, but they cannot saw,
        hammer, cut, stab.

The link-type is absolutely crucial to understanding a word.The language-learning project is all about learning the

        link-types. Without correct link-type assignments, you cannot
        have correct parses.

        ... which is 100% of the problem with MST. The problem with
        MST is not so much that "its not accurate" -sure, it is not
        terribly accurate. But even if MST or some MST-replacement was
        100% accurate, it would still be "wrong" because it fails to
        indicate the link-type.  If you want to understand a sentence,
        you MUST know the link-types!

        Otherwise, you just have "green ideas sleep furiously", which
        parses, but only because the link types have been erased, or
        made stupid. Here's a stupid grammar:

        ideas:  {adjective}- & {verb}+;
        green: {adjective}+;

        which allows "green ideas" to parse.  But of course, this is
        wrong; it should have been:

        ideas: {noospheric-modifier}- & {concept-manipulating-verb}+;
        green: {physical-object-modifier}+;

        and now it is clear that "green ideas" cannot parse, because
        the link-types clash.

        * If you cluster down to 5 or 6 clusters (adjective, verb,
        noun ...) you will get very low quality grammars.

        * If you cluster to 200 or 300 clusters, you get sort-of-OK
        grammars. This is what deep-learning/neural-nets do: this is
        why the deep-learning systems seem to give nice results: 200
        or 300 features is enough to start having adequate functional
        distinctions (e.g. the famous "king - male+female=queen"
        example, or "paris-france+germany=berlin" example)

        * If you cluster to 3K to 8K clusters, you start having a
        quite decent model of language

        * Note that wordnet has 117K "synsets".

        Note that in the above example:
        wood: {sculptable-mass}- or ({quantity-determiner}- &
        {viewable-thing}-);

        the things in the curly-braces are effectively "synsets".

        The next set of goal-posts is to have disjuncts, of maybe
        low-medium quality, and use these to extract ontologies.  e.g.
        {sculptable-mass} is-a {mass} is-a {physical-thing} is-a {thing}

        You can try to do this by clustering but there are probably
        better ways of discovering ontology.

            >Also -- no hint of any word-classes or part-of-speech
            tagging? This is surely important to evaluate as well, or
            is this to be done in some other way?  i.e. to evaluate if
            "Pivi" was correctly clustered with other given names?  Or
            that lama/llama was clustered with other four-legged animals?

            We don't have that in MST-Parsing, right? We need this
            corpus to assess the quality of the MST-Parsing so we
            don't need part-of-speech information for that.

        But we know that MST parsing is shit.  Stop wasting time on
        MST or trying to "improve" it. We already know that it is
        close to a high-entropy path to structure; trying to squeeze a
        few more percent of entropy is not worth the effort, not at
        this time.  Focus on finding a high-entropy structure
        extraction algorithm, don't waste time on MST.

        You should be focusing on extracting disjuncts, word-classes,
        word-senses, and trying to improve the quality of those.  If
        you obtain a high-entropy path to these structures, the
        quality of your parses will automatically improve.  Focus on
        the entropy numbers. Try to maximize that.

            The clustering is able to do that anyway - see the graphs
            in the end of the last year report:

            
https://docs.google.com/document/d/1gxl-hIqPQCYPb9NNkyA3sBYUyfwvJFvT1hZ5ZpXsaPc/edit#heading=h.twoiv52o0tou

            >Also -- I can't tell -- is it free of loops, or are loops
            allowed?  Allowing loops tends to provide stronger, more
            accurate parses.  Loops act as constraints.

            The loops and crossing links are not allowed in the
            MST-Parser now. If we allow them in the test corpus, how
            could it make assessment of MST-Parses better?

            Note, that we ARE working we MST-Parses now - accordingly
            to Ben's directions.


        Not to say bad things about Ben, but I'm certain he has not
        actually thought about this problem very much. He is very very

busy doing other things; he is not thinking about this stuff.I have repeatedly tried to explain the issues to him, and its

        quite clear that he is far away from understanding them, from
        working at the level that I would like to have you and your
        team work at.

        I'm trying to have you make small, quantified baby-steps, to
        verify the accuracy of your methods and data.  What I'm seeing
        is that you are attempting to make giant-steps, without
        verification, and then getting low-quality results, without
        understanding the root causes for them.  You can't dig
        yourself out of a ditch, and digging harder and more furiously
        won't raise the accuracy of the parse results.

        --linas

            We have your MST-Parser-less idea on the map but we are
            NOT trying it now:

            https://github.com/singnet/language-learning/issues/170

            We may try it after we explore the account for costs

            https://github.com/singnet/language-learning/issues/183

            Thanks,

            -Anton

            24.03.2019 9:24, Linas Vepstas пишет:

            Also, BTW, link-grammar cannot parse "I just stood there,
            my hand on the knob, trembling like a leaf." correctly.
            It is one of a class of sentences it does not know about.
            Which is maybe OK, because ideally, the learned grammar
            will be able to do this. But today, LG cannot.

            --linas

            On Sat, Mar 23, 2019 at 9:12 PM Linas Vepstas
            <[email protected] <mailto:[email protected]>>
            wrote:

                Anton,

                It's certainly an unusual corpus, and it might give
                you rather low scores. I'd call it "interesting", but
                maybe not "golden". Although I suppose it depends on
                your training corpus.  Here are some problems that
                pop out:

                First sentence --
                "the old beast was whinnying on his shoulder" -- the
                word "whinnying" is a fairly rare English verb -- you
                could read half-a-million wikipedia articles, and not
                see it once. You could read lots of 19th-century or
                early-20th century cowboy/adventure novels, (like
                what you'd find on Project Gutenberg) and maybe see
                it some fair amount. Even then -- to "whinny on a
                shoulder" seems bizarre.. I guess he's hugging the
                horse? How often does that happen, in any cowboy
                novel? "to whinny on something" is an extremely rare
                construction.  It will work only if you've correctly
                categorized "whinny" as a verb that can take a
                preposition.  Are your clustering algos that good,
                yet, to correctly cluster rare words into appropriate
                verb categories?

                Second sentence .. "Jims" is a very uncommon name.
                Frankly, I've never heard of it as a name before.
                Your training data is going to be extremely slim on
                this. And lack of training data means poor
                statistics, which means low scores.  Unless -- again,
                your clustering code is good enough to place "Jims"
                in a "proper name" cluster...

                "the lama snuffed blandly" -- "snuffed" is a very
                uncommon, almost archaic verb. These days, everyone
                spells llama with two ll's not one. Unless your
                talking about Buddhist monks, its a typo.

                "you understand?"  is .. awkward. Common in speech,
                uncommon in writing. Unlikely that you'll have enough
                training data for this.

                "Willard" is an uncommon name. Does your training
                corp[us have a sufficient number of mentions of
                Willard? Do you have clustering working well enough
                to stick "Willard" into a cluster with other names?

                "it is so with Sammy Jay" is clearly archaic English.

                "he hasn't any relations here" is clearly archaic, an
                olde-fashioned construction.

                "Pivi said not one word" - again, a clearly
                old-fashioned construction. Does the training set
                contain enough examples of "Pivi" to recognize it as
                a name? Are names clustering correctly?

                Any sentence with an inversion is going to sound
                old-fashioned. All of the sentences in that corpus
                sound old-fashioned. Which maybe is OK if you are
                training on 19th century Gutenberg texts .. but its
                certainly not modern English.  Even when I was a
                child, and I read those old crumbly-yellow paper
                adventure books, part of the fun was that no one
                actually talked that way -- not at school, not at
                home, not on TV. It was clearly from a different time
                and place -- an adventure.

                Anyway -- you only indicate pair-wise word-links. Is
                the omission of disjuncts intentional? Also -- no
                hint of any word-classes or part-of-speech tagging?
                This is surely important to evaluate as well, or is
                this to be done in some other way?  i.e. to evaluate
                if "Pivi" was correctly clustered with other given
                names?  Or that lama/llama was clustered with other
                four-legged animals?

                Also -- I can't tell -- is it free of loops, or are
                loops allowed?  Allowing loops tends to provide
                stronger, more accurate parses.  Loops act as
                constraints.

                -- Linas

                On Thu, Mar 21, 2019 at 11:09 PM Anton Kolonin @
                Gmail <[email protected]
                <mailto:[email protected]>> wrote:

                    Hi Linas, Andes and whoever understands LG and
                    English well enough both.

                    Attached are first 100 sentences for GC "gold
                    standard" - manually checked based on LG parses.

                    We are expecting more to come in the next two weeks.

                    To enable that, please have cursory review of the
                    corpus and let us know if there are corrections
                    still needed so your corrections will be used as
                    a reference to fix the rest and keep going further.

                    Thank you,

                    -Anton

--You received this message because you are

                    subscribed to the Google Groups "lang-learn" group.
                    To unsubscribe from this group and stop receiving
                    emails from it, send an email to
                    [email protected]
                    <mailto:[email protected]>.
                    To post to this group, send email to
                    [email protected]
                    <mailto:[email protected]>.
                    To view this discussion on the web visit
                    
https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com
                    
<https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com?utm_medium=email&utm_source=footer>.
                    For more options, visit
                    https://groups.google.com/d/optout.

--cassette tapes - analog TV - film cameras - you

---Anton Kolonin

            skype: akolonin
            cell: +79139250058
            [email protected]  <mailto:[email protected]>
            https://aigents.com
            https://www.youtube.com/aigents
            https://www.facebook.com/aigents
            https://medium.com/@aigents
            https://steemit.com/@aigents
            https://golos.blog/@aigents
            https://vk.com/aigents

--cassette tapes - analog TV - film cameras - you--You received this message because you are subscribed to the

        Google Groups "lang-learn" group.
        To unsubscribe from this group and stop receiving emails from
        it, send an email to [email protected]
        <mailto:[email protected]>.
        To post to this group, send email to
        [email protected] <mailto:[email protected]>.
        To view this discussion on the web visit
        
https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com
        
<https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com?utm_medium=email&utm_source=footer>.
        For more options, visit https://groups.google.com/d/optout.

---Anton Kolonin

        skype: akolonin
        cell: +79139250058
        [email protected]  <mailto:[email protected]>
        https://aigents.com
        https://www.youtube.com/aigents
        https://www.facebook.com/aigents
        https://medium.com/@aigents
        https://steemit.com/@aigents
        https://golos.blog/@aigents
        https://vk.com/aigents



--
cassette tapes - analog TV - film cameras - you

--

You received this message because you are subscribed to the GoogleGroups "opencog" group.To unsubscribe from this group and stop receiving emails from it, sendan email to [email protected]<mailto:[email protected]>.To post to this group, send email to [email protected]<mailto:[email protected]>.

Visit this group at https://groups.google.com/group/opencog.

To view this discussion on the web visithttps://groups.google.com/d/msgid/opencog/CAHrUA36URqnNdjG-qjAScr-serD%3DoT%2B-%2BHfWkdZZxsKUZXvR8A%40mail.gmail.com<https://groups.google.com/d/msgid/opencog/CAHrUA36URqnNdjG-qjAScr-serD%3DoT%2B-%2BHfWkdZZxsKUZXvR8A%40mail.gmail.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/8361a7bc-4c67-0edc-16d4-2d789d98855f%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [opencog-dev] Re: 100 sentences for GC

Reply via email to