Hi Mikel ::::On Saturday 12 November 2011 Mikel Forcada said:::: > Al 11/12/2011 10:31 AM, En/na Kevin Donnelly ha escrit: > > FWIW, I think the fundamental problem is that the format of the > > dictionaries is non-optimal from a linguistic point of view. > > Kevin, it would be good to hear a bit more detail about how it would be > improved, as at some point we should revive the process towards > unification and standardization of metadix.
Caveat - I am not a CS specialist, and it's a couple of years since I worked directly on the Apertium format, so my memory may be hazy, or things may have changed. :-) Also, sorry for the length .... Current situation (as I see it) ------------------------------------------ I think there is a bundle of issues which combine to make things non-optimal: -- the format requires words to be segmented; -- the dictionary boundary doesn't necessarily align with morpheme boundaries; -- many words are handled indirectly via paradigms. The result is that expanding the dictionaries is actually quite involved - you examine the word-list, decide on paradigms and code them up, assign the words to paradigms, code up those that fit the paradigms, and code individually any words that do not fit into a paradigm. However, you can't necessarily use the paradigms you find in grammar-books, because the format uses orthographic instead of morphemic boundaries, so you may have to refactor the paradigms first, which is a non-trivial task in my experience. There is also the point that most speakers do not think in terms of paradigms anyway - they just "know" that a particular form "sounds" right - and working out the finer points of inflected tenses or locatives for rarely-used words is often not a trivial exercise either. *Some* of this can be simplified via scripting - but then you need to be able to script, and in my experience the number of linguists (let alone interested members of the public) in any given location who can use regular expressions (never mind scripting!) can be counted on the fingers of half a hand. I also found updating the dictionaries even more difficult than creating them, but that's just a personal view based on my loathing of XML, and I accept that others probably find it simplicity itself. :-) So, what to do? I would suggest a few things to make dictionary maintenance an order of magnitude easier: (a) Remove paradigms from the dictionary. ------------------------------------------------------------- In effect, you are splitting words artificially (not along linguistically- accepted lines) on input, so that you can put them back together again at lookup. It would be simpler just to enter and look up a full-form word. Paradigms serve no useful purpose at this location. They belong more to grammar (or at least morphology) than lexicography. (It is true that a Latin dictionary will show mensa, -ae, as an aide-mémoire, but I think that if unlimited paper had been available in the old days, they would have written out each form in full, not mens- as a headword, and then -a, -am, -arum, etc.) Paradigms may be quite useful in languages like Polish or Latvian, with a highly inflected system, but they are neither use nor ornament in analytic languages like English (only rudimentary inflections left) or Chinese, or agglutinative languages (Bantu or American languages). See (c) below for how to fill the paradigm-shaped gap. (b) Make a grid the standard format for input. ----------------------------------------------------------------- Most people are quite familiar with tables, and in fact a dictionary entry is a squished-up table. (And it cannot be a coincidence that all language dictionaries are presented in this format.) So if you tell helpers "this column is for the word, this is for the meaning, this is for the declension, this is for the gender", it can be easily grasped. At one stroke you have moved the work from "something that requires technical knowledge" to "something that I use every day". A grid can be accessed in a spreadsheet, a database, a table in a word- processor, or a text-editor (if you separate each column with a tab), so it requires no specialised software. It does not distract the user with node- names, or confound him with a missing bracket. (I know there are GUI interfaces, but (1) they need to be installed, and (2) in my experience, they are slow to work with.) The benefit of a grid is that it varies only minorly between languages of completely different families (again, this cannot be a coincidence) - in other words, a basic template, extended as necessary, will go a long way (see the NoDaLiDa paper at http://siarad.org.uk/publications.php for Spanish/Welsh and English, and it works OK with Swahili too in the verb segmenter). The drudgery of adding words remains, though: I just added 1,500 new words to the Spanish and Welsh dictionaries for the autoglosser, and the average time to tidy, check a printed dictionary, and add was just over a minute per word (about 29 hours). However, updating a dictionary in a grid format is trivial. (c) Instead of devising an interface to the current format, devise upstream tools for populating a grid format. -------------------------------------------------- Paradigms (in this view) are gone, but paradigms are still useful for some languages, where entering multiple cases for a noun (for example) is rather tedious. So produce tools to generate common (they don't have to be all- encompassing) forms based on a couple of column entries. For Latin you might have lexeme-root (mens), nominative singular (a), genitive singular (ae), declension (1), and then have a generator that uses those to fill in the other forms, all of which are added full-form to the dictionary grid. For a Bantu language you might have the lexeme (mti), and word-class (3), and generate the plural (4, miti). The benefit of this is that it's much easier for helpers to get started with a few manually-added "common" words in the grid,and then move progressively towards complete coverage (at the point where adding minimally-differentiated words becomes more tedious than trying to work out rules for recurrent changes). For example, depending on source text, the subjunctive and past historic tenses in French may be relatively low priority. The generators may also be useful tools for other purposes apart from Apertium. (d) Conversely, do trivial stemming as part of the lookup. --------------------------------------------------------------------------------- Certain recurrent variations don't merit the name of paradigms, but may not need to be in the dictionary either. These could be handled by minimalist regexes (though HFST, recently mentioned on this list, might be a candidate for more heavyweight work). For instance, I think over 85% of the verbforms in the Apertium Spanish dictionary are forms with clitic pronouns, which really don't need to be there (so I've taken them out). Most English verbforms (walks, walked, walking) don't need to be in the English dictionary (though I have more to do on that). In morphemically fairly regular languages like Spanish or Italian, a word ending in -a (eg a feminine adjective) or -ito (a diminutive) that does not appear in the dictionary can have the ending switched to -o to see if anything like that is in the dictionary, and so on. Again, the benefit of this is that it can be progressively applied as time or requirement permits - it's not something that has to be done all at once at the beginning. (e) Develop a set of quickstart templates for particular language-types ----------------------------------------------------------------------------------------------------- It would be worth stepping back to consider what pieces of information about particular languages need to be recorded in the dictionary, and why. For large swathes of languages, the required information will be almost identical, showing only minor differences (if any) between languages in the same family, and greater differences (though less extensive than might be expected) between language-groups. The idea would be to offer helpers a grid template that would be likely to suit their language, and let them start on that. Inevitably, some additions may be required, but these could be made organically, and fed back into the template resources. This would also be a good entrée towards trying to engage linguists as well as fellow CS/MT people - since Apertium is an RBMT rather than an SMT system, any input from them will be doubly effective. For better or worse, that's my tupporth. :-) I think Apertium is a tremendous resource, not least because of the collection of data that the project has amassed. With Google now beginning to charge for its translator, Apertium is probably best-placed to become THE open translator of choice, though of course there's a distance to go yet. -- Pob hwyl / Best wishes Kevin Donnelly kevindonnelly.org.uk ------------------------------------------------------------------------------ RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
