Missatge de Francis Tyers <fty...@prompsit.com> del dia dj., 18 de juny 2020 a les 1:59:
> El 2020-06-17 21:46, Hèctor Alòs i Font escribió: > > Missatge de Hèctor Alòs i Font <hectora...@gmail.com> del dia dc., > > 17 de juny 2020 a les 23:36: > > > >> Missatge de Francis Tyers <fty...@prompsit.com> del dia dc., 17 de > >> juny 2020 a les 21:12: > >> > >>> El 2020-06-15 17:38, Hèctor Alòs i Font escribió: > >>>> Here come several practical examples. I tried to select them for > >>> their > >>>> variety. The result is more a wish list than something > >>> structured. > >>> > >>> These really are great! Thanks :) Sorry the reply has taken so > >>> long. > >>> > >>>> Let's begin with "je la baise". Depending on the context this > >>> may be > >>>> "I kiss her" or "I fuck her". The context can tell us if we are > >>> in a > >>>> formal or colloquial type of language. Another issue is that in > >>> this > >>>> case the anaphora resolution can also help us: if the pronoun > >>>> reference is "hand", it can only be "kiss"; if it is a person, > >>> the > >>>> doubt persists. > >>> > >>> For this, I would like to look at a concordance* of a large number > >>> of > >>> examples to see what kind of information can be used to > >>> disambiguate. > >>> > >>> Intuitively it seems like knowing the genre (e.g. formal/informal) > >>> would > >>> help. But probably also statistics about subjects, objects and > >>> adjuncts, > >>> and what they (co-)refer with. > >>> > >>> * I tried to search on DuckDuckGo, but in the "internet" domain it > >>> is very hard to find examples with "kiss", even with "moderated > >>> search" > >>> turned on. > >>> > >>> In fact, perhaps that could be a genre "safe translation"... :D > >>> > >>> Incidentally Google gives "I fuck her" as the translation. I'm > >>> able to > >>> get > >>> "kiss" by adding "bouche" or "main". > >>> > >>> I think if we want to go by frequency we should have "fuck" if we > >>> go > >>> by safety we should have "kiss". > >>> > >>> Probably "humblement" or "vous" are also good indicators of the > >>> "kiss" > >>> meaning. > >>> > >>> Any better than that would require further investigation with a > >>> concordance. > >>> > >>> In terms of the module, if we want to do informal/formal then my > >>> previous > >>> suggestion would work fine. > >>> > >>>> Another kind of problem is the Arpitan words "chamô" ("camel"; > >>> plural > >>>> "camels") and "chamôs ("chamois"; unchanged in plural). So, > >>>> translating into French, I got yesterday chamois in a Bible text > >>> of > >>>> Exodus xD I solved it deciding in a CG rule that all "chamôs" > >>>> (without nothing around in singular) are camels. > >>> > >>> As this is a different morphological paradigm, I would go with the > >>> > >>> superscript > >>> notation ¹²³... > >>> > >>>> (Similar cases in > >>>> French: fil/fils, foi/fois, cour/cours) > >>> > >>> These have different lemmas, e.g. > >>> > >>> ^fils/fil<n><m><pl>/fils<n><m><sp>$ threads / son* > >>> ^fois/foi<n><f><pl>/fois<n><f><sp>$ faiths / time* > >>> ^cours/cour<n><f><pl>/cours<n><m><sp>$ courts / course* > >>> > >>> The 'cour/cours' example can potentially be disambiguated by the > >>> gender. > >>> > >>> The others I suppose rules could be written, but I suspect they > >>> would be > >>> quite brittle. My guess is that the <sp> ones are more frequent. > >>> So > >>> those > >>> should be default, then the question is finding specific contexts > >>> where > >>> it should be the others. A concordance would help, but I'm not > >>> sure how > >>> they would be split by genre or semantic field. This is really a > >>> problem > >>> with how world-knowledge is encoded. > >>> > >>> I wonder if something could be done with word embeddings here. For > >>> > >>> example > >>> my guess is that in the target language the two variants should > >>> not > >>> be close in the vector space. And they should be closer to words > >>> in the > >>> same semantic field. This could then be something like a > >>> reweighting of the translations according to target language > >>> semantic > >>> coherence. > >>> > >>> Note that it would require information to be "backpropagated" from > >>> the > >>> target > >>> language to the source language. Perhaps you could have something > >>> like > >>> per-reading embeddings that are trained using target language > >>> information, > >>> > >>> so e.g. (fils, fil<n><m><pl>) [0.323, 0.423, 0.11, 0.595] > >>> (fils, fils<n><m><sp>) [0.53, 0.605, 0.54, 0.639] > >>> > >>> Felipe did something like this in his thesis, but he only looked > >>> at > >>> sequences of part of speech tags. Here we need to know information > >>> about > >>> the actual analyses. > >>> > >>>> In French there are plenty of words with different meanings, > >>> depending > >>>> on the genre: livre, page, tour, etc. The problem is that often > >>> the > >>>> immediate surrounding context does not disambiguate: des livres, > >>> les > >>>> pages, de tour, etc. > >>> > >>> This sounds like it would work with some kind of longer distance, > >>> bag-of-wordsy context module. > >>> > >>>> A similar but slightly different case is the word > >>>> pairs homicide mf/homicide m, féminicide mf/féminicide m, > >>> parricide > >>>> mf/parricide, etc.: the one with the genre "mf" is a person and > >>> the > >>>> other is the action. > >>> > >>> This looks like the fil/fils problem. > >>> > >>>> Other problems come in lexical selection. For instance, as a > >>> rule, > >>>> Catalan preposition "de" is translated as "de" in French, but if > >>> the > >>>> following word is a material, "en" must be selected (de fusta > > >>> en > >>>> bois). So in the Catalan2French lrx file we have a list of > >>> materials, > >>>> as we have a list of countries, a list of musical instruments, a > >>> list > >>>> of animals, etc. I dream about a monolingual dictionary where we > >>> could > >>>> get this kind of information. It is not useful to have these > >>> lists for > >>>> many language pairs using Catalan. This information should be in > >>>> apertium-cat and not in every apertium-cat-xxx lrx file. > >>> > >>> Yes, this kind of information should certainly go in the > >>> monolingual > >>> packages. A quick and simple approach would be to make a .metadix > >>> style > >>> format > >>> where this information is stored around entries and generated in > >>> .lrx > >>> style > >>> lists by an xslt script. > >>> > >>> <e lm="fusta" list="materia"><i>fust</i><par n="abell/a__n"/></e> > >>> <e lm="ferro" list="materia"><i>ferro</i><par n="abric__n"/></e> > >>> > >>> -> > >>> > >>> <def-seq n="materia"> > >>> <match lemma="fusta"/> > >>> <match lemma="ferro"/> > >>> </def-seq> > >>> > >>> These could then be included into the .lrx file by the Makefile, > >>> or > >>> a separate, monolingual, file could be another argument to > >>> lrx-comp. > >>> > >>>> Moreover, If we had words not only with different kind of > >>> semantic > >>>> labels, but also marked as synonyms, maybe it'd be possible to > >>> give a > >>>> translation using a word labeled as synonym (if it has a > >>> translation) > >>>> instead of "unknown". > >>> > >>> Not sure about this one, some concrete examples would help. > >> > >> For instance, we have in the bilingual dictionary that "tomàquet" > >> is "tomato", but if in the text we have one of the many > >> possibilities for "tomato" in Catalan, and we do not have added it > >> in the bilingual dictionary, "tomàtiga", "domàtiga", "tomata", > >> etc. could be translated into "tomato" if someone has not added all > >> of them into the bilingual dictionary (as usually nobody does). The > >> same is for low-frequence synonyms, e.g. old-fashioned, etc. > >> (Spanish "congratular" for "felicitar", Catalan "maridar" for > >> "casar", etc.). > > > > I'd add that one of the problems with that is that this synonyms may > > be polysemic. For instance "bubota" seems to be quite widely used in > > Balearic Catalan, but can mean both "scarecrow" and "ghost". Probably > > just one of the two could be selected as synonym if "bubota" is > > missing in a bilingual dictionary. > > Yep, this is the kind of thing that people are working on at the moment > with neural machine translation. For example in translating informal > texts, how do you make sure that you get the translations of "today", > "2day" "tooday" "tday" etc, such as in: > > > https://www.clsp.jhu.edu/workshops/19-workshop/improving-translation-of-informal-language/ This kind of problems are typical, and even very often, for languages without or with a weak standard. For instance, currently in Arpitan along the standard termination "ament" in many nouns and adverbs, I found tens of "ement" and even "èment". Similarly instead of "ê" I found "è", or the opposite, and instead of "â" I found "a", or the opposite. It's a big mess, when I get "real" texts on the net. But defining in the monodix that every "ê" can be "è", and the opposite, every "â" can be "a", and the opposite" would cause a huge quantity of homonyms that would make disambiguation almost impossible (so I won't do it). So, this kind of improvement may help translators from underresourced languages... if enormous corpora are not required to learn the "rules". Hèctor > > > Fran >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff