On Wed, Feb 12, 2020 at 5:18 AM Adrian Borucki <[email protected]> wrote:
>
>
> On Tuesday, 11 February 2020 18:45:44 UTC+1, linas wrote:
>>
>> Salut Amirouche,
>>
>> What you describe was/is the goal of the language-learning project. It is
>> stalled, because there is no easy way to evaluate if it is making forward
>> progress, and is learning a good grammar, or learning junk.
>>
>> The proposed solution to this is to create "random" grammars, and thus
>> compare what the system learned to the precise, exactly-known grammar. The
>> only problem here is that generating a corpus of sentences drawn from a
>> given grammar is surprisingly hard. (i.e. is not an afternoon project, or
>> even a one-week project).
>>
>
> Is using English language as the target too limited (overfitting)?
>
No, not at all.
There are datasets for grammatical tasks like Part Of Speech tagging, the
> quality of the grammar could be judged by testing performance on those
> tasks.
>
That's what we thought too, initially, and it turns out that doesn't work.
Sometimes the problems are minor: the datasets contain errors, which is
annoying. More difficult are issues surrounding small grammars and small
corpora: e.g. "child-directed speech" (CDS) -- one can naively think that
CDS is just like adult-English grammar, but with a smaller vocabulary, but
there's a hint that mathematically, that is just not true. It becomes
particularly obvious when the training corpus is so small that you can run
the calculations by pencil-n-paper: the probabilities really come out quite
different.
One soon trips over other problems too: Some at the front end:
capitalization, punctuation, anything to do with tokenization (imagine
irregular French verbs!) But also problems near the back end - problems
with multiple meanings, synonyms, etc. - one sees many interesting things
that are both correct and wrong at the same time: e.g. turns out that
"houses" and "cities" are quite similar: both can be entered, both provide
shelter, both can be a home... and yet a house is not a city. I saw many
unexpected but interesting groupings like this... reminiscent of corpus
linguistics results.
Oh, and corpora problems: wikipedia lacks "action verbs": run jump kick
punch, since wikipedia is describing things (X is a Y which Z when W). It
also has vast amounts of foreign words, product names, geographical names,
obscure terminology: words which appear only once or twice, ever.
Project Gutenberg is mostly 19th, early-20th century English which is very
unlike modern English. We can read it and understand it, but there are many
quite strange and unusual sentences in there, which no one would ever say,
today. Jane Austen's Pride & Prejudice is a good example.
It became clear that existing corpora and datasets are almost useless for
evaluating quality. I want to be able to control for size of vocabulary,
size of grammar, density and distribution of different parts-of-speech,
and, most importantly, the distribution of different meanings ("I saw" -
"to see" vs "to cut") ... and then evaluate the algos as each of these
parameters are changed.
The idea is that the learning system is like a transducer:
grammar->corpora->grammar, so its like a microphone, or a high-fidelity
stereo system: we want to evaluate how well it works. Of course, you can
"listen to it" and "see if it sounds good", but really, I'd rather measure
and be precise, and for that, we really need inputs that can be controlled,
and see how the system responds, as we "turn the knob".
-- Linas
>
>> I would love to work on this, but well, good old capitalistic
>> considerations are currently blocking my efforts.
>>
>> --linas
>>
>> On Tue, Feb 11, 2020 at 9:16 AM Amirouche Boubekki <[email protected]>
>> wrote:
>>
>>> I am wondering whether there is existing material about how to
>>> bootstrap a LG-like dictionary using a seed of natural language
>>> elements: grammar, words, punctuation....
>>>
>>> The idea is to use such a seed to teach the program more about the
>>> full grammar using almost natural conversations.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "opencog" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/opencog/CAL7_Mo-Lb_c_1i8uhMWCnXkeOtBPkJtqqwosBp0V2398TUhcfA%40mail.gmail.com
>>> .
>>>
>>
>>
>> --
>> cassette tapes - analog TV - film cameras - you
>>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/25a07c59-3e71-4a0d-9229-a0784d9ee5e9%40googlegroups.com
> <https://groups.google.com/d/msgid/opencog/25a07c59-3e71-4a0d-9229-a0784d9ee5e9%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
--
cassette tapes - analog TV - film cameras - you
--
You received this message because you are subscribed to the Google Groups
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/opencog/CAHrUA36BDpXryJ%3DwpCbga13tFw-7UeYvJXkgbihqu56yb9j4Rw%40mail.gmail.com.