Kevin, > > However, affixing has 2 parts: > > 1. create an affix file > > 2. add the proper affixes to the individual words in the dictionary > file. > > > > I completely miss 2. in gramadoir. > > Yes, I have some of 2. in place for Gramadóir. And > this part I admit to being completely undocumented! > > All that it amounts to at this stage is a simple-minded > Perl script that takes as input an affix file > and a large plain text corpus of text in the > target language (from http://borel.slu.edu/crubadan/). > For each flag in the affix file, it applies > the rules under that flag *in reverse* > (i.e. strips affixes) to all of > the words it sees in the corpus and looks for > common "root words".
Can it only strip? The condition and modification (addition) are important. For example alma is almák in plural, therefore the rule: trip ák and add a to get the original word. > For example, imagine there are 6 rules under flag "A", > and I find words like "grokker", "grokking", "grokked", > "grokkish", "grokalicious" in the corpus > such that 5 out of 6 of the flag A rules apply > to give the root "grok". Then it might be safe to add > "grok/A" to the word list. Why would it be safe if you could not find the sixth rule in the corpus? > These candidates can be ranked by percentage if you like; > in any case it's usually a good idea to check the output > manually. We've had some luck with this approach for Basque, > which has rich morphology. > > It would be nice to generalize this approach > to work with HunSpell if anyone's feeling > up to the task; I imagine it could > start to get computationally expensive > for large multilevel affix files and large corpora. > > I'm guessing I'm not the first to have written > something like this - in fact, maybe Laci et al > already have something like this in HunSpell; > I admit I haven't looked carefully yet. > > The other important question is automatically > constructing the affix file itself from a plain text > corpus. This is obviously much harder. > Anyone interested in this question should have > a look at John Goldsmith's Linguistica > project at the Univ. of Chicago. I've played > around with the demo and it look promising. > http://linguistica.uchicago.edu/ example 3-1 in http://borel.slu.edu/gramadoir/manual/c409.html#POS says: dipper 31 dire 36 direct 33 direct 36 direct 37 directed 36 direction 31 directional 36 directions 32 What do 31, 32, etc mean? Are they groups of flags in myspell? Where are these flags documented? Where is their connection with the affixes documented? I also cannot see the kind of word there (verb, noun, etc...) Finally: How are the grammatical errors formulated and entered into gramadoir. For example the error in Hungarian: I see two boys if we write boys it is an error, because if the number is there, the noun must be singular. In human language: after a verb if there is a number, or words that express quantites( many, several, some), the subsequential noun must be singular. How to formulate this in gramadoir? Thanks, Eleonora -- GMX DSL = Maximale Leistung zum minimalen Preis! 2000 MB nur 2,99, Flatrate ab 4,99 Euro/Monat: http://www.gmx.net/de/go/dsl --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
