[agi] Concept Naming Algorithm Re: Auxlangs: International Auxilary Languages

Logan Streondj Sat, 23 Jan 2016 15:38:18 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi AGI list,


As you may know I'm working on SPEL (Speakable Programming for Every
Language),  it has an intermediary language which is used by the VM,
however while easily parse-able it is also speakable. Anyways so
recently I realized that my phonotactics can be improved a little, so
decided should redefine the whole vocabulary base, but since doing it
manually last time took me nearly half a year for a mere 1000 words,
I'd like to automate the process this time.

I tried asking for input from the Auxlang list, but it seems it may be
too technical for them, so I was hoping that the wise people of the
AGI list would be able to offer some input from their vast repertoire
of technical knowledge.

I would like us all to get our brains together and come up with a
root-word-generating, concept naming algorithm the super majority of us
can agree on. "naming and caching are the two biggest problems in
programming",  An average human can see and remember 4 things in
short-term memory (cache),  so having root words that are up to 4
glyphs long is optimal. Naming is a big problem in programming, with
different API's choosing different words, and different ordering of
words.  Hopefully with SPEL the naming process can become a
standardized algorithmic process. Note that while the words discussed
are in the intermediary language, all human languages will benefit as
they can be translated to and from it.

There are a few aspects to Concept Name Root Word Generating,
* Phonology
* Phonotactics
* Source Languages
* Word to concept matching

As the AGI crowd, can probably skip over to Word to Concept matching,
but I've included the other sections for completeness.

*Phonology*

I'm fairly sure just about everyone agrees on the phonology I use, as no
one has much complained about it.  Here is my definition for the 24
glyph alphabet,
with word order based on the same as in phoibles.
note y = /j/, c = /ʃ/, j = /ʒ/ , a = /ä/ otherwise they are all IPA
equivalents,
var Glyph24Alphabet =
["m","k","i","a","y","u","p","w",

"n","s","t","l","h","f",".","c",
                 "e","o","r","b","g","d","z","j"];


*Phonotactics:*

so I've talked about phonotactics before, here and on the conlang
mailing list, and now Victor Chan has brought it up.

So with his system, assuming a 16 glyph alphabet there are 663 possible
syllables.  If we assume affricates are okay, then that brings it up to
852, or 1101 if /l/ can be second and final.

with 24 glyph alphabet, (which is about average for world langs),
then have 2825 syllables with only glides as seconds or 4440 syllables
if including affricates or 5495 if including /l/ as second or last.

a 1000 words gives about 80% fluency, 3000 gives 90% fluency,
around 8000 is average fluency,  with good writers having around 15,000,
and great writers as much as 30,000

However for an IAL while there would certainly need to be room for as
much as English (million words +), however the specialty words can be
more complex, either compounds or using greater range of phonemes.

So I gues 852 core and 4440 for fluent vocab should be enough,
I know my wife has trouble pronouncing /tla/ or any /tl/ initial,
so I'm guessing she's not the only one.  Chan mentioned it for the
Chinese speakers which are more than a billion, so making a language
easy for them to learn is important.

**Source Languages*

For the purpose of my algorithms I can only use language which are
included in Google Translate,  though not all of them as that would
unfairly bias it towards those languages which are represented.  Thus
I've decided to go for several languages, each of which represent major
language families.

Chinese (Sino-Tibetan),
English (West Germanic),
Spanish (Romance),
Hindi  (Indo-Aryan)
Indonesian (Austronesian)
Russian (Slavic),
Swahili, (Niger-Congo)
Swedish (North Germanic),
Turkish (Turkic),
Finnish (Uralic),
Farsi (Indo-Iranian),
Greek (Hellenic).

If you know any other major language families which are available on
Google Translate, but are not listed here,  have alternative
translation-engines or have qualms with the ones I have listed then
please comment.

*Word To Concept Matching

*
Now this is actually one of the most complicated parts.
Basically I take the proto-language approach, that says if a bunch of
langauges have a word or phoneme in common, then it is a good one to use
for the proto-language, or in this case for the auxlang.
Thus words have the most common phonemes which are represented in world
languages.

However, due to the limited syllable space, and that some phonemes are
simply more common than others, not all words can have their ideal
phoneme set.  To distinguish which "deserve" to get closer to the ideal,
can use usage lists, or word-frequency lists.  So if a word is used more
often, it deserves to be closer to the ideal phonemes.

Additionally there can be part of the algorithm that can identify the
optimal place for the phoneme representations,  for instance if the
phoneme is within the first few letters of the word,  then it is most
likely to become the first or second letter of the IAL root.   similarly
if it is near the end, then it is more likely to become the final
consonant.

If there aren't really any good options, and only a few relatively
phonemes are available, then the goal would be to approximate a word
from one of the existing languages,.

The algorithm would accept four inputs, the list of words to define,
the word frequency list, the list of currently defined and possible
words and source translations with phonemic transcriptions.  It would
organize the words to define based on their frequency,  with the ones
that are most frequent being defined first.

It would output how many of each phoneme are found in the source
languages,
what are the popular starter phonemes, what are the popular final
phonemes.
Then it would look at the list of defined and possible words,  and see
which of the possible words may match either the begining phoneme, the
ending phoneme, and the central vowel, if necessary the secondary
phoneme. Based on what it finds it would output 4 of the top
possibilities for that word,
or if it is a good enough match, then it may simply define it by itself.

Having an algorithm that can do this is quite important in the real
world of IAL, since making vocabulary sucks up so much time, it took me
months to make the 1000 or so current words of Mwak/Lank, which I think
is really ridiculous,  but in so doing I think I've analyzed the
algorithm I used to make it,  which is the above.

If you have some more suggestions for it, from your own experience with
worldlang creations, please share,  we can all benefit :-).

Thanks,
Logan



On Sat, Jan 9, 2016 at 8:45 PM, Leo Moser <[email protected]
<mailto:[email protected]>> wrote:

    __ __

    __ __

    *From:*Victor Chan [mailto:notification+od
    <mailto:notification%2Bod>[email protected]
    <mailto:[email protected]>]
    *Sent:* Saturday, January 9, 2016 1:41 PM
    *To:* Auxlangs: International Auxilary Languages
    <[email protected] <mailto:[email protected]>>
    *Subject:* [Auxlangs: International Auxilary Languages] Since there
    are a lack of resource for...____

    __ __

      Victor Chan posted in Auxlangs: International Auxilary Languages .
          Victor Chan January 9 at 1:40pm   Since there are a lack of
    resource for phonotactics of worldlang. I am going to make a
    suggestion for phonotactics of worldlang. To avoid bias toward
    languages which gain their prestige from imperialism, I will use
    universal tendency to decide the maximal complexity of the
    phonotactics. The moderately complex syllable structure by Ian
    Maddieson from the WALS website will be used as a basis since it
    occur in nearly half of the observed languages. The interphonology
    articles implies that /s/ + C initial cluster and C + /s/ final
    clusters may actually be learnable enough to also be considered for
    worldlang but I will exclude them since violation of the sonority
    profile is rare cross-linguistically. Some recent ESL articles on
    native Chinese and native Indian speakers will be used to specify
    the phonotactics within the moderately complex range. /l/ is found
    to be difficult in onset cluster so the second element of the onset
    cluster should be restrict to glides. Using the combined data from
    multiple ESL articles, the learnability of final consonant will be
    provided in this decreasing order: voiceless obstruent, nasal,
    liquid, and voiced obstruent. The ESL articles provide conflicting
    result on only one final consonant which is /l/; the research on
    Indian ESL implies that it is easy to acquire but the research on
    Chinese ESL implies that it is one of the most difficult.
    Multilingualism in Indian can be a possible explanation for this
    conflicting result. Final nasal was also found to be difficult
    before some diphthongs. Using this analysis, the phonotactics will
    be (C) (G) V (C) where G represent glides and final C are restricted
    to voiceless obstruent and nasal where nasal can only occur before
    monophthong.   Like Comment    ____

    __ __

        

    __ __

        

       ____

       ____

        

     ____

    Victor Chan

<https://www.facebook.com/n/?victor.chan.50951101&aref=1452375648203519&;
medium=email&mid=528ed4501702dG243d9e5dG528ed8e9772ffG96G8389&bcode=1.14
52375650.AbnrxKDcpiEg0S7y&n_m=leomoser%40gmail.com>
    posted in Auxlangs: International Auxilary Languages

<https://www.facebook.com/n/?groups%2Fauxlangs%2F&aref=1452375648203519&;
medium=email&mid=528ed4501702dG243d9e5dG528ed8e9772ffG96G8389&bcode=1.14
52375650.AbnrxKDcpiEg0S7y&n_m=leomoser%40gmail.com>.____

     ____

    <https://www.facebook.com/victor.chan.50951101>____

        

       ____

        

    *Victor Chan* <https://www.facebook.com/victor.chan.50951101>____

    January 9 at 1:40pm____

     ____

                

    Since there are a lack of resource for phonotactics of worldlang. I
    am going to make a suggestion for phonotactics of worldlang. To
    avoid bias toward languages which gain their prestige from
    imperialism, I will use universal tendency to decide the maximal
    complexity of the phonotactics. ____

    __ __


    The moderately complex syllable structure by Ian Maddieson from the
    WALS website will be used as a basis since it occur in nearly half
    of the observed languages. The interphonology articles implies that
    /s/ + C initial cluster and C + /s/ final clusters may actually be
    learnable enough to also be considered for worldlang but I will
    exclude them since violation of the sonority profile is rare
    cross-linguistically. ____

    __ __


    Some recent ESL articles on native Chinese and native Indian
    speakers will be used to specify the phonotactics within the
    moderately complex range. /l/ is found to be difficult in onset
    cluster so the second element of the onset cluster should be
    restrict to glides. ____

    __ __

    Using the combined data from multiple ESL articles, the learnability
    of final consonant will be provided in this decreasing order:
    voiceless obstruent, nasal, liquid, and voiced obstruent. ____

    __ __

    The ESL articles provide conflicting result on only one final
    consonant which is /l/; the research on Indian ESL implies that it
    is easy to acquire but the research on Chinese ESL implies that it
    is one of the most difficult. Multilingualism in Indian can be a
    possible explanation for this conflicting result. Final nasal was
    also found to be difficult before some diphthongs.
    Using this analysis, the phonotactics will be (C) (G) V (C) where G
    represent glides and final C are restricted to voiceless obstruent
    and nasal where nasal can only occur before monophthong.____

     ____

        



    __ __

        

    __ __

        

    __ __

     ____

                

     ____

        

       ____

       ____

        

     ____

        



    __ __

        

    __ __

        

    __ __

        

    __ __

    __ __

        

    __ __

                        

        

       ____

       ____

        

    __ __

        

       ____

       ____

        

    __ __

        

       ____

     ____

    ____

    __ __




-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iQIcBAEBAgAGBQJWpA68AAoJEIbY/Hz61ycpCLYP/Rbiwi55PTxi5Nflaoe3RjNq
DmDx+jlWa/t1I1KFskY2RLbVA4PlXRAfoNDsoCmXK//8GsLQU8LxLepnTUsrp17u
rGJQtKqA8zDBQzOr2uBeAQruh4gO58ZgjJRzM8Ma32HDVEZi6IaCL3yQvTItJ5zq
Rw59DTC3EOu6IjjYiDg0Rn6LSULJoyPEp9YPexdMNKyJtbyAYg2yQ6rpnArph5Mz
e/du/l5Gg7K/AIdhoC9xa/+hp/uJIdZlOdeo/TTLgWzOZLX3WxeujTw1F03AFwh0
0FNnNfU9ObJkKU73VCdNmqjSRZxlTHtFXJXN4vuHByHPT5vjERp2VpIFwzXlP14X
0OS8gnnAGsS9IMXSwBbOROl2pf2OEevNCK/P9ZE6nXm4qmEE9i4xQQKXbUaUsmbI
5yVdEuhjAipZRml+3S8paDCeBr4G+b9je2DlifaYjQty/6RU7bW7JL9KEAA5xGTu
Vuj9mbozLGjnsO/BsmNBsXhpNcg7+aZfdQxeokgfuN32kdvmEYXOGxV2EkC14XZo
on8Xo8bA1EQEfHXRlTlSopB5d8kEjubkazt5tZDhpKf7DPYXiex39o0k6RtBedq4
SI1BkjGO/J3hNkjc9NF9hueEfxu8NGuJ2YrwzjHzCtzP+67N7vIJUQJCBP1K+RG6
/uESdAWzuzmdA8ZC76et
=tbNy
-----END PGP SIGNATURE-----


-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

[agi] Concept Naming Algorithm Re: Auxlangs: International Auxilary Languages

Reply via email to