Dear Apertiumers,

as "principal instigator" and current PMC president of Apertium, I think 
my two cents worth is expected here, so here I go. It will be a long two 
cents, so pour youselves your favouite drink before reading it.

1. Apertium has a very, very flexible language to specify lexical 
transformations, such as the ones found on bilingual dictionaries. This 
allows for many different "coding styles". This freedom has, on the one 
hand, made Apertium a very successful project, but, on the other hand, 
allowed for divergent styles of coding.

2. The existence for different styles of coding does not worry me "per 
se" (after all, this is a free/open-source project and therefore 
everyone's project) but I think it would be a very, very good idea for 
the PMC to do (or promote) some substantial work on public 
recommendations on how linguistic data should be built, as its absence 
and the existence of so many radically diverging "dix dialects" may 
effectively drive people away from adopting Apertium for "serious" work. 
In particular I worry about maintainability, as this is crucial for quality.

3. There is no normative decision as regards what information should go 
in a bilingual dictionary, but,yes, there was a tradition. When Apertium 
started, it was used to translate between Romance languages, which meant 
that tranlsations were basically word-per-word, and structural transfer 
did not cover all words. This was the reason to have bilingual 
dictionaries that only encoded a prefix of the lexical forms: the 
remaining part was simply copied or just slightly modified by transfer, 
as all morphological dictionaries were much alike. In most cases, we 
coded them as in a paper bilingual dictionary. but left out gender, for 
instance, when it did not change. This was inherited, in fact, from 
interNOSTRUM, and was not questioned as it was working reasonably.  But 
now, Apertium covers many different languages and morphological 
dictionaries are sometimes very different. Therefore, the question 
arises as to what to encode there. Different criteria may be used. 
Francis Tyers seems to favour reusability (which is nice, but, I agree 
with Felipe, secondary if it is not reusability inside Apertium), but I 
don't think this entails including complete lexical forms like the ones 
that started this thread in the dictionary (after all, not including 
them makes the dictionaries as compact as paper bilingual dictionaries 
which do not contain everything). Another criterion is compactness. 
Héctor Alòs considers a "radical" prefix approach, where not even the 
part-of-speech would be featured. Another criterion is to encode what is 
more likely to be preserved by transfer, which is what speakers of both 
languages would put in a bilingual dictionary as morphology would be 
automatically discounted in their minds. But as I said above, we need to 
reflect on the interplay between these criteria and try to draft a 
recommendation.

4. One thing in favour of having more than the minimum information 
necessary is that excess information that is the same on both sides may 
easily be automatically removed for applications like the ones Felipe 
mentions.

5. I am not in favour of using <i> in bilingual dictionaries, if you 
want a coding recommendation from a pioneer. It is an early mistake 
(from the times of Spanish-Catalan and Spanish-Galician) that should be 
avoided.

6. We don't usually have paradigms in bilingual dictionaries but I 
believe that could avoid a great deal of "default" structural transfer 
by adding paradigms to bilingual dictionaries that would deal with the 
tags in "default" situations when morphological dictionaries are very 
different in their tagsets. Just an idea from your president.

7. Isn't this the kind of stuff that would have to be treated in an 
Apertium conference? Shouldn't people draft RFC's (requests for 
comments), and shouldn't all of us discuss them?

I hope you have arrived here and Iook forward to your comments.

Mikel


Al 06/28/2012 01:49 PM, En/na Jimmy O'Regan ha escrit:
> On 28 June 2012 09:16, Francis Tyers <[email protected]> wrote:
>> El dj 28 de 06 de 2012 a les 08:20 +0200, en/na Felipe Sánchez Martínez
>> va escriure:
>>> Hi all,
>>>
>>>> I usually encode all relevant data when building a bilingual dictionary,
>>>> to make later reuse easier. For example, in the Breton--French
>>>> dictionary, I put POS + gender on both sides even if the gender doesn't
>>>> change, because that way it makes reuse of the data easier (it means you
>>>> don't have to look up (possibly ambiguous) lemmas in the morphology.
>>> I do not see how not encoding the morphological information that does
>>> not change makes the data less reusable. All the "relevant" information
>>> is there and the "irrelevant" one can be easily added.
>> Not easily added by a non-expert user. If there were tools that would
>> automatically "introduce" this information it may be a different
>> matter.
>>
> For the non-Java-phobic, dixtools has a tool for this.
>


-- 
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to