Dear Apertiumers, as "principal instigator" and current PMC president of Apertium, I think my two cents worth is expected here, so here I go. It will be a long two cents, so pour youselves your favouite drink before reading it.
1. Apertium has a very, very flexible language to specify lexical transformations, such as the ones found on bilingual dictionaries. This allows for many different "coding styles". This freedom has, on the one hand, made Apertium a very successful project, but, on the other hand, allowed for divergent styles of coding. 2. The existence for different styles of coding does not worry me "per se" (after all, this is a free/open-source project and therefore everyone's project) but I think it would be a very, very good idea for the PMC to do (or promote) some substantial work on public recommendations on how linguistic data should be built, as its absence and the existence of so many radically diverging "dix dialects" may effectively drive people away from adopting Apertium for "serious" work. In particular I worry about maintainability, as this is crucial for quality. 3. There is no normative decision as regards what information should go in a bilingual dictionary, but,yes, there was a tradition. When Apertium started, it was used to translate between Romance languages, which meant that tranlsations were basically word-per-word, and structural transfer did not cover all words. This was the reason to have bilingual dictionaries that only encoded a prefix of the lexical forms: the remaining part was simply copied or just slightly modified by transfer, as all morphological dictionaries were much alike. In most cases, we coded them as in a paper bilingual dictionary. but left out gender, for instance, when it did not change. This was inherited, in fact, from interNOSTRUM, and was not questioned as it was working reasonably. But now, Apertium covers many different languages and morphological dictionaries are sometimes very different. Therefore, the question arises as to what to encode there. Different criteria may be used. Francis Tyers seems to favour reusability (which is nice, but, I agree with Felipe, secondary if it is not reusability inside Apertium), but I don't think this entails including complete lexical forms like the ones that started this thread in the dictionary (after all, not including them makes the dictionaries as compact as paper bilingual dictionaries which do not contain everything). Another criterion is compactness. Héctor Alòs considers a "radical" prefix approach, where not even the part-of-speech would be featured. Another criterion is to encode what is more likely to be preserved by transfer, which is what speakers of both languages would put in a bilingual dictionary as morphology would be automatically discounted in their minds. But as I said above, we need to reflect on the interplay between these criteria and try to draft a recommendation. 4. One thing in favour of having more than the minimum information necessary is that excess information that is the same on both sides may easily be automatically removed for applications like the ones Felipe mentions. 5. I am not in favour of using <i> in bilingual dictionaries, if you want a coding recommendation from a pioneer. It is an early mistake (from the times of Spanish-Catalan and Spanish-Galician) that should be avoided. 6. We don't usually have paradigms in bilingual dictionaries but I believe that could avoid a great deal of "default" structural transfer by adding paradigms to bilingual dictionaries that would deal with the tags in "default" situations when morphological dictionaries are very different in their tagsets. Just an idea from your president. 7. Isn't this the kind of stuff that would have to be treated in an Apertium conference? Shouldn't people draft RFC's (requests for comments), and shouldn't all of us discuss them? I hope you have arrived here and Iook forward to your comments. Mikel Al 06/28/2012 01:49 PM, En/na Jimmy O'Regan ha escrit: > On 28 June 2012 09:16, Francis Tyers <[email protected]> wrote: >> El dj 28 de 06 de 2012 a les 08:20 +0200, en/na Felipe Sánchez Martínez >> va escriure: >>> Hi all, >>> >>>> I usually encode all relevant data when building a bilingual dictionary, >>>> to make later reuse easier. For example, in the Breton--French >>>> dictionary, I put POS + gender on both sides even if the gender doesn't >>>> change, because that way it makes reuse of the data easier (it means you >>>> don't have to look up (possibly ambiguous) lemmas in the morphology. >>> I do not see how not encoding the morphological information that does >>> not change makes the data less reusable. All the "relevant" information >>> is there and the "irrelevant" one can be easily added. >> Not easily added by a non-expert user. If there were tools that would >> automatically "introduce" this information it may be a different >> matter. >> > For the non-Java-phobic, dixtools has a tool for this. > -- Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/) Departament de Llenguatges i Sistemes Informàtics Universitat d'Alacant E-03071 Alacant, Spain Phone: +34 96 590 9776 Fax: +34 96 590 9326 ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
