On Wed, Jul 1, 2020 at 11:42 AM Amirouche Boubekki <
[email protected]> wrote:

> Le mar. 30 juin 2020 à 20:04, Linas Vepstas <[email protected]> a
> écrit :
> >
> > Hi Amirouche,
> >
> > Such statistical systems might currently do as well as or better than
> grammar,
>
> They will always fail on new grammar rules, and re-training the
> algorithm over those is painful I guess. Maybe new grammar rules is
> not a use-case since nowadays there is grammar-police-robot in every
> browser...
>

Well, I was trying to be expansive. Some people want to build a
grammar-police-robot, and there are various ways of doing that.

For opencog, a goal is to understand speech, and if you look at transcripts
of zoom videos, you quickly appreciate that no one speaks in a
"grammatical" fashion ... Written and spoken English are almost two
different languages.


> > since, it turns out, writing a complete grammar is impossible; there is
> always yet one more, extremely rare exception to every rule.  Roughly
> speaking, grammar rules have a Zipfian distribution.
>
> I mean probably yes with the approach taken by link-grammar where only
> a couple of experts can edit the grammar. My goal is to make it also
> easy for the user to improve the grammar,


I believe that this is a fundamentally impossible task. The proof would be,
if it was easy, some linguist would have done it by now. I mean, linguists
have been working on grammar since before there were computers, and what
have we got?

hence the single source of
> truth database


There is no single source of truth for language. Every human being speaks
and understands a language differently. You are not aware of it, because
there is enough overlap that you can communicate with others.  But, if you
take samples of Irish-American or German-American from the 1920's or even
Black-American or Hispanic-American from modern times, you will discover
that it's grammar is quite drastically different than New York Times
English.  And there are more gradations: the rules that work for
Irish-American fail on British-Irish.  The geographical separation, and the
separation of 2-5 decades in time, means that these two dialects have
diverged.  They are quite different. There is no single source of truth.


> that will include the dictionary of words, grammar
> relations between words, and test cases for each relation.


There is not even one unique parse for any given sentence.  Poetry is
filled with ambiguous sentences. But if you want a textbook example: "I saw
the man with the telescope" which has two parses that are both equally
plausible.  This is surprisingly common. Mild alterations of wording can
have dramatic changes in parses and meaning. Simple substitutions of
"synonymous" words can have dramatic changes in parses and meaning.

  based on a subset
> of English grammar that is not ambiguous


There is no such subset. It does not exist.


> > "A sat solver and an okvs" -- we've played with SAT. Basically, the SAT
> solvers are slower. They used to be sometimes-faster, but Amir's work fixed
> that.
>
> How much slower ? 2 or 3 times slower ?


Now? maybe 1.1x to 5x slower. Depends on the sentence.  Depends on the
dictionary (say, the Russian vs the English dict). The algorithms are
different; the performance profile is different.  If you are starting from
scratch, then the SAT solver is "good enough". In LG, there is no incentive
to pursue it; the SAT solver is not as flexible, it has trouble with parse
ranking.

I mean, there is nothing wrong with minisat as a SAT solver. It's fine.
What's wrong is the idea that SAT is suitable for parsing. It's not.

Basically, if you assign each grammar rule a probability, (or a probability
and a confidence) then you would like to rank parses according to
probability. The SAT algo's don't know how to  avoid exploration of
low-probability regions.  Its a classic combinatorial-explosion problem:
you have to perform hill-climbing, or simulated annealing, or whatever, as
you parse, and trim the search space as much as possible.   SAT algo's
don't do that -- they explore the entire search space, and return a yes-no
answer. You don't want a yes/no answer, you want a  likelihood. SAT solvers
choke on any problem that has a combinatoric explosion.

There is an interesting middle-ground, by stealing the key SAT idea. The
key idea behind SAT is: prune away all trees, all DAG's, and then
exhaustively explore the remaining multiply-connected,
combinatorially-explosive core. SAT is "fast" because "most" (!??)
problems, after pruning, have tiny cores. So, likewise, in the "ideal
parser", one would prune away all branches, and then explore the
combintatorially-explosive core with your favorite combinatorial-explorer:
hill-climbing, simulated-annealing, bayesian-net, neural-net, genetic algo,
or whatever. I've been trying to get Ben and others interested in this idea
for  umpteen years, but no luck yet.  I'm even writing grant proposals. But
no one understands what I say 😢  I've been told that it's either
"trivially obvious" or its "gibberish"  ... I can't find the middle
ground...


>
> Thanks for the enlightening conversation. I was imagining such a thing
> but did not have time to look into aspell intricates.
>

There is nothing at all intricate about aspell. You give it a mis-spelled
word, and it offers up alternative suggestions. It's based on ispell, which
is maybe 40 years old, now, or more? Pre-internet software.


> It seems to me learning grammar in
> an unsupervised way, is not ready?
>

It's not ready. I encountered two very different problems.

In some of the learned grammars, a large number of alternatives are
produced, and the existing link-grammar parser is too slow. There are 5 or
10-word sentences that take a day or two to parse, because the lexis
contains 10K alternatives per word.  If you have the patience to wait that
long, the resulting parse looks pretty good ....This is the combinatorial
explosion I was talking about. Of course, I don't have to generate those
kinds of dictionaries -- there are others with parse very quickly, but with
lower accuracy.

The other, more important problem is the lack of a reference standard.  The
solution to that is obvious: treat parsing as 1/2 of a transducer-pair,
with a generator as the other 1/2.  Think of, for example, a microphone and
a loudspeaker. Ideally, you push a sine-wave through a loudspeaker, and get
a sine wave back on the microphone. You cannot measure the quality of the
microphone .. or the loudspeaker ... without this.

The solution is to generate random grammars, consisting of N words, and M
grammatical rules, and a Zipfian or other probability distribution for the
use of words/rules. Then generate a corpus of K "random" sentences. Now
apply the learning tool ... can you learn all N words, all M rules
correctly? How does the accuracy depend on the size K of the corpus?  How
fast can one learn, as a function of N, M, and the word-distribution?

I've started to build that generator, but I am not done yet.

Trying to use human-annotated corpora gives garbage. Its a beginner mistake.

--linas

-- 
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA37t2DrdJs_WPQBhGzRs31S2m3HrXmnOW0s2AnZC3gUTjw%40mail.gmail.com.

Reply via email to