El dj 28 de 06 de 2012 a les 08:20 +0200, en/na Felipe Sánchez Martínez
va escriure:
> Hi all,
> 
> > I usually encode all relevant data when building a bilingual dictionary,
> > to make later reuse easier. For example, in the Breton--French
> > dictionary, I put POS + gender on both sides even if the gender doesn't
> > change, because that way it makes reuse of the data easier (it means you
> > don't have to look up (possibly ambiguous) lemmas in the morphology.
> 
> I do not see how not encoding the morphological information that does 
> not change makes the data less reusable. All the "relevant" information 
> is there and the "irrelevant" one can be easily added.

Not easily added by a non-expert user. If there were tools that would
automatically "introduce" this information it may be a different
matter. 

But if you want to go from a bilingual dictionary to some kind of
CSV/text-based dictionary, with all the grammatical information, then
including it is useful.

$ lt-expand apertium-br-fr.br-fr.dix | grep -v -e ':>:' -e ':<:' -e
'REGEX' 

charretour<n><m>:charretier<n><m>
charretour<n><f>:charretier<n><f>
chase<n><m>:chasse<n><f>
chasgeu<n><m>:suisse<n><mf>
chastre<n><m>:désagrément<n><m>
chastre<n><m>:soin<n><m>
 ...

This already gives a useful bilingual dictionary for other purposes. If
the gender information were not there it would be less useful, and
adding the gender information would require substantially more
programming than one line of 'grep'.

> > In any case I think it is probably not a good idea to assume that the
> > bilingual dictionary only encodes "different" information. If there is
> > another way to find it out, it would be better.
> 
> Not encoding the morphological information that does not change makes it 
> possible to automatically infer structural transfer rules with 
> apertium-transfer-tools. This tool is around for more than 4 years.
> 
> I think it is not a good idea to change the way we do things. When we 
> designed Apertium we took the decision of not encoding the morphological 
> information that does not change in the bilingual dictinary and I think 
> that we should stand to what we decided at that moment if there is not a 
> "good" reason for the change and "reusability" is not (see above).

I've been doing it this way since as long as I can remember. 

Also, what you really mean, is the "information apart from part of
speech that does not change". Otherwise we should have entries like:

  <e><p><l>coche</l><r>cotxe</r></p></e>

Anyway, I'm sure some solution can be come up with, perhaps a prefix
list of parts-of-speech, and then compare the remaining tags to see if
they are equivalent on both sides. 

But in any case, if you need to test apertium-transfer-tools, then there
are pairs which I think follow the old standard: es-ca, es-pt etc.

Newer ones (anything with 3+ transfer [not relevant anyway], af-nl,
sv-da, mk-bg, nn-nb)

Regards,

Fran


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to