El 2020-06-15 17:38, Hèctor Alòs i Font escribió:
Here come several practical examples. I tried to select them for their
variety. The result is more a wish list than something structured.

These really are great! Thanks :) Sorry the reply has taken so long.

Let's begin with "je la baise". Depending on the context this may be
"I kiss her" or "I fuck her". The context can tell us if we are in a
formal or colloquial type of language. Another issue is that in this
case the anaphora resolution can also help us: if the pronoun
reference is "hand", it can only be "kiss"; if it is a person, the
doubt persists.

For this, I would like to look at a concordance* of a large number of
examples to see what kind of information can be used to disambiguate.

Intuitively it seems like knowing the genre (e.g. formal/informal) would
help. But probably also statistics about subjects, objects and adjuncts,
and what they (co-)refer with.

* I tried to search on DuckDuckGo, but in the "internet" domain it
is very hard to find examples with "kiss", even with "moderated search"
turned on.

In fact, perhaps that could be a genre "safe translation"... :D

Incidentally Google gives "I fuck her" as the translation. I'm able to get
"kiss" by adding "bouche" or "main".

I think if we want to go by frequency we should have "fuck" if we go
by safety we should have "kiss".

Probably "humblement" or "vous" are also good indicators of the "kiss" meaning.

Any better than that would require further investigation with a concordance.

In terms of the module, if we want to do informal/formal then my previous
suggestion would work fine.

Another kind of problem is the Arpitan words "chamô" ("camel"; plural
"camels") and "chamôs ("chamois"; unchanged in plural). So,
translating into French, I got yesterday chamois in a Bible text of
Exodus xD  I solved it deciding in a CG rule that all "chamôs"
(without nothing around in singular) are camels.

As this is a different morphological paradigm, I would go with the superscript
notation ¹²³...

(Similar cases in
French: fil/fils, foi/fois, cour/cours)

These have different lemmas, e.g.

^fils/fil<n><m><pl>/fils<n><m><sp>$     threads / son*
^fois/foi<n><f><pl>/fois<n><f><sp>$     faiths / time*
^cours/cour<n><f><pl>/cours<n><m><sp>$  courts / course*

The 'cour/cours' example can potentially be disambiguated by the gender.

The others I suppose rules could be written, but I suspect they would be
quite brittle. My guess is that the <sp> ones are more frequent. So those
should be default, then the question is finding specific contexts where
it should be the others. A concordance would help, but I'm not sure how
they would be split by genre or semantic field. This is really a problem
with how world-knowledge is encoded.

I wonder if something could be done with word embeddings here. For example
my guess is that in the target language the two variants should not
be close in the vector space. And they should be closer to words in the
same semantic field. This could then be something like a
reweighting of the translations according to target language semantic
coherence.

Note that it would require information to be "backpropagated" from the target
language to the source language. Perhaps you could have something like
per-reading embeddings that are trained using target language information,

so e.g. (fils, fil<n><m><pl>) [0.323, 0.423, 0.11, 0.595]
        (fils, fils<n><m><sp>) [0.53, 0.605, 0.54, 0.639]

Felipe did something like this in his thesis, but he only looked at
sequences of part of speech tags. Here we need to know information about
the actual analyses.

In French there are plenty of words with different meanings, depending
on the genre: livre, page, tour, etc. The problem is that often the
immediate surrounding context does not disambiguate: des livres, les
pages, de tour, etc.

This sounds like it would work with some kind of longer distance,
bag-of-wordsy context module.

A similar but slightly different case is the word
pairs homicide mf/homicide m, féminicide mf/féminicide m, parricide
mf/parricide, etc.: the one with the genre "mf" is a person and the
other is the action.

This looks like the fil/fils problem.

Other problems come in lexical selection. For instance, as a rule,
Catalan preposition "de" is translated as "de" in French, but if the
following word is a material, "en" must be selected (de fusta > en
bois). So in the Catalan2French lrx file we have a list of materials,
as we have a list of countries, a list of musical instruments, a list
of animals, etc. I dream about a monolingual dictionary where we could
get this kind of information. It is not useful to have these lists for
many language pairs using Catalan. This information should be in
apertium-cat and not in every apertium-cat-xxx lrx file.

Yes, this kind of information should certainly go in the monolingual
packages. A quick and simple approach would be to make a .metadix style format where this information is stored around entries and generated in .lrx style
lists by an xslt script.

<e lm="fusta" list="materia"><i>fust</i><par n="abell/a__n"/></e>
<e lm="ferro" list="materia"><i>ferro</i><par n="abric__n"/></e>

->

    <def-seq n="materia">
       <match lemma="fusta"/>
       <match lemma="ferro"/>
    </def-seq>

These could then be included into the .lrx file by the Makefile, or
a separate, monolingual, file could be another argument to lrx-comp.

Moreover, If we had words not only with different kind of semantic
labels, but also marked as synonyms, maybe it'd be possible to give a
translation using a word labeled as synonym (if it has a translation)
instead of "unknown".

Not sure about this one, some concrete examples would help.

Fran


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to