Dear Apertiumers,
as "principal instigator" and current president of Apertium, I think my two
cents worth is expected here, so here I go. It will be a long two cents, so
pour youselves your favouite drink before reading it.
1. Apertium has a very, very flexible language to specify lexical
transformations, such as the ones found on bilingual dictionaries. This
allows for many different "coding styles". This freedom has, on the one
hand, made Apertium a very successful project, but, on the other hand,
allowed for divergent styles of coding.
2. The existence for different styles of coding does not worry me "per se"
(after all, this is a free/open-source project and therefore everyone's
project) but I think it would be a very, very good idea for the PMC to do
(or promote) some substantial work on public recommendations on how
linguistic data should be built, as its absence and the existence of so
many radically diverging "dix dialects" may effectively drive people away
from adopting Apertium for "serious" work. In particular I worry about
maintainability, as this is crucial for quality.
3. There is no normative decision as regards what information should go in
a bilingual dictionary, but,yes, there was a tradition. When Apertium
started, it was used to translate between Romance languages, which meant
that tranlsations were basically word-per-word, and structural transfer did
not cover all words. This was the reason to have bilingual dictionaries
that only encoded a prefix of the lexical forms: the remaining part was
simply copied or just slightly modified by transfer, as all morphological
dictionaries were much alike. In most cases, we coded them as in a paper
bilingual dictionary. but left out gender, for instance, when it did not
change. This was inherited, in fact, from interNOSTRUM, and was not
questioned as it was working reasonably. But now, Apertium covers many
different languages and morphological dictionaries are sometimes very
different. Therefore, the question arises as to what to encode there.
Different criteria may be used. Francis Tyers seems to favour reusability
(which is nice, but, I agree with Felipe, secondary if it is not
reusability inside Apertium), but I don't think this entails including
complete lexical forms like the ones that started this thread in the
dictionary (after all, not including them makes the dictionaries as compact
as paper bilingual dictionaries which do not contain everything). Another
criterion is compactness. Héctor Alòs considers a "radical" prefix
approach, where not even the part-of-speech would be featured. Another
criterion is to encode what is more likely to be preserved by transfer,
which is what speakers of both languages would put in a bilingual
dictionary as morphology would be automatically discounted in their minds.
But as I said above, we need to reflect on the interplay between these
criteria and try to draft a recommendation.
4. One thing in favour of having more than the minimum information
necessary is that excess information that is the same on both sides may
easily be automatically removed for applications like the ones Felipe
mentions.
5. I am not in favour of using <i> in bilingual dictionaries, if you want a
coding recommendation from a pioneer. It is an early mistake (from the
times of Spanish-Catalan and Spanish-Galician) that should be avoided.
6. We don't usually have paradigms in bilingual dictionaries but I believe
that could avoid a great deal of "default" structural transfer by adding
paradigms to bilingual dictionaries that would deal with the tags in
"default" situations when morphological dictionaries are very different in
their tagsets. Just an idea from your president.
7. Isn't this the kind of stuff that would have to be treated in an
Apertium conference? Shouldn't people draft RFC's (requests for comments),
and shouldn't all of us discuss them?
I hope you have arrived here and Iook forward to your comments.
Mikel
2012/6/28 Jimmy O'Regan <[email protected]>
> On 28 June 2012 09:16, Francis Tyers <[email protected]> wrote:
> > El dj 28 de 06 de 2012 a les 08:20 +0200, en/na Felipe Sánchez Martínez
> > va escriure:
> >> Hi all,
> >>
> >> > I usually encode all relevant data when building a bilingual
> dictionary,
> >> > to make later reuse easier. For example, in the Breton--French
> >> > dictionary, I put POS + gender on both sides even if the gender
> doesn't
> >> > change, because that way it makes reuse of the data easier (it means
> you
> >> > don't have to look up (possibly ambiguous) lemmas in the morphology.
> >>
> >> I do not see how not encoding the morphological information that does
> >> not change makes the data less reusable. All the "relevant" information
> >> is there and the "irrelevant" one can be easily added.
> >
> > Not easily added by a non-expert user. If there were tools that would
> > automatically "introduce" this information it may be a different
> > matter.
> >
>
> For the non-Java-phobic, dixtools has a tool for this.
>
> --
> <Sefam> Are any of the mentors around?
> <jimregan> yes, they're the ones trolling you
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Apertium-stuff mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
--
Mikel L. Forcada E-mail: [email protected]
Departament de Llenguatges Phone: +34-96-590-9776
i Sistemes Informàtics also +34-96-590-3772.
UNIVERSITAT D'ALACANT Fax: +34-96-590-9326, -3464
E-03071 ALACANT, Spain.
URL: http://www.dlsi.ua.es/~mlf
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff