Missatge de Francis Tyers <fty...@prompsit.com> del dia dc., 17 de juny
2020 a les 21:12:

> El 2020-06-15 17:38, Hèctor Alòs i Font escribió:
> > Here come several practical examples. I tried to select them for their
> > variety. The result is more a wish list than something structured.
>
> These really are great! Thanks :) Sorry the reply has taken so long.
>
> > Let's begin with "je la baise". Depending on the context this may be
> > "I kiss her" or "I fuck her". The context can tell us if we are in a
> > formal or colloquial type of language. Another issue is that in this
> > case the anaphora resolution can also help us: if the pronoun
> > reference is "hand", it can only be "kiss"; if it is a person, the
> > doubt persists.
>
> For this, I would like to look at a concordance* of a large number of
> examples to see what kind of information can be used to disambiguate.
>
> Intuitively it seems like knowing the genre (e.g. formal/informal) would
> help. But probably also statistics about subjects, objects and adjuncts,
> and what they (co-)refer with.
>
> * I tried to search on DuckDuckGo, but in the "internet" domain it
> is very hard to find examples with "kiss", even with "moderated search"
> turned on.
>
> In fact, perhaps that could be a genre "safe translation"... :D
>
> Incidentally Google gives "I fuck her" as the translation. I'm able to
> get
> "kiss" by adding "bouche" or "main".
>
> I think if we want to go by frequency we should have "fuck" if we go
> by safety we should have "kiss".
>
> Probably "humblement" or "vous" are also good indicators of the "kiss"
> meaning.
>
> Any better than that would require further investigation with a
> concordance.
>
> In terms of the module, if we want to do informal/formal then my
> previous
> suggestion would work fine.
>
> > Another kind of problem is the Arpitan words "chamô" ("camel"; plural
> > "camels") and "chamôs ("chamois"; unchanged in plural). So,
> > translating into French, I got yesterday chamois in a Bible text of
> > Exodus xD  I solved it deciding in a CG rule that all "chamôs"
> > (without nothing around in singular) are camels.
>
> As this is a different morphological paradigm, I would go with the
> superscript
> notation ¹²³...
>
> > (Similar cases in
> > French: fil/fils, foi/fois, cour/cours)
>
> These have different lemmas, e.g.
>
> ^fils/fil<n><m><pl>/fils<n><m><sp>$     threads / son*
> ^fois/foi<n><f><pl>/fois<n><f><sp>$     faiths / time*
> ^cours/cour<n><f><pl>/cours<n><m><sp>$  courts / course*
>
> The 'cour/cours' example can potentially be disambiguated by the gender.
>
> The others I suppose rules could be written, but I suspect they would be
> quite brittle. My guess is that the <sp> ones are more frequent. So
> those
> should be default, then the question is finding specific contexts where
> it should be the others. A concordance would help, but I'm not sure how
> they would be split by genre or semantic field. This is really a problem
> with how world-knowledge is encoded.
>
> I wonder if something could be done with word embeddings here. For
> example
> my guess is that in the target language the two variants should not
> be close in the vector space. And they should be closer to words in the
> same semantic field. This could then be something like a
> reweighting of the translations according to target language semantic
> coherence.
>
> Note that it would require information to be "backpropagated" from the
> target
> language to the source language. Perhaps you could have something like
> per-reading embeddings that are trained using target language
> information,
>
> so e.g. (fils, fil<n><m><pl>) [0.323, 0.423, 0.11, 0.595]
>          (fils, fils<n><m><sp>) [0.53, 0.605, 0.54, 0.639]
>
> Felipe did something like this in his thesis, but he only looked at
> sequences of part of speech tags. Here we need to know information about
> the actual analyses.
>
> > In French there are plenty of words with different meanings, depending
> > on the genre: livre, page, tour, etc. The problem is that often the
> > immediate surrounding context does not disambiguate: des livres, les
> > pages, de tour, etc.
>
> This sounds like it would work with some kind of longer distance,
> bag-of-wordsy context module.
>
> > A similar but slightly different case is the word
> > pairs homicide mf/homicide m, féminicide mf/féminicide m, parricide
> > mf/parricide, etc.: the one with the genre "mf" is a person and the
> > other is the action.
>
> This looks like the fil/fils problem.
>
> > Other problems come in lexical selection. For instance, as a rule,
> > Catalan preposition "de" is translated as "de" in French, but if the
> > following word is a material, "en" must be selected (de fusta > en
> > bois). So in the Catalan2French lrx file we have a list of materials,
> > as we have a list of countries, a list of musical instruments, a list
> > of animals, etc. I dream about a monolingual dictionary where we could
> > get this kind of information. It is not useful to have these lists for
> > many language pairs using Catalan. This information should be in
> > apertium-cat and not in every apertium-cat-xxx lrx file.
>
> Yes, this kind of information should certainly go in the monolingual
> packages. A quick and simple approach would be to make a .metadix style
> format
> where this information is stored around entries and generated in .lrx
> style
> lists by an xslt script.
>
> <e lm="fusta" list="materia"><i>fust</i><par n="abell/a__n"/></e>
> <e lm="ferro" list="materia"><i>ferro</i><par n="abric__n"/></e>
>
> ->
>
>      <def-seq n="materia">
>         <match lemma="fusta"/>
>         <match lemma="ferro"/>
>      </def-seq>
>
> These could then be included into the .lrx file by the Makefile, or
> a separate, monolingual, file could be another argument to lrx-comp.
>
> > Moreover, If we had words not only with different kind of semantic
> > labels, but also marked as synonyms, maybe it'd be possible to give a
> > translation using a word labeled as synonym (if it has a translation)
> > instead of "unknown".
>
> Not sure about this one, some concrete examples would help.
>

For instance, we have in the bilingual dictionary that "tomàquet" is
"tomato", but if in the text we have one of the many possibilities for
"tomato" in Catalan, and we do not have added it in the bilingual
dictionary, "tomàtiga", "domàtiga", "tomata", etc. could be translated into
"tomato" if someone has not added all of them into the bilingual dictionary
(as usually nobody does). The same is for low-frequence synonyms, e.g.
old-fashioned, etc. (Spanish "congratular" for "felicitar", Catalan
"maridar" for "casar", etc.).




>
> Fran
>
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to