Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

Hèctor Alòs i Font Thu, 18 Jun 2020 00:01:04 -0700

Missatge de Francis Tyers <fty...@prompsit.com> del dia dj., 18 de juny
2020 a les 1:59:


> El 2020-06-17 21:46, Hèctor Alòs i Font escribió:
> > Missatge de Hèctor Alòs i Font <hectora...@gmail.com> del dia dc.,
> > 17 de juny 2020 a les 23:36:
> >
> >> Missatge de Francis Tyers <fty...@prompsit.com> del dia dc., 17 de
> >> juny 2020 a les 21:12:
> >>
> >>> El 2020-06-15 17:38, Hèctor Alòs i Font escribió:
> >>>> Here come several practical examples. I tried to select them for
> >>> their
> >>>> variety. The result is more a wish list than something
> >>> structured.
> >>>
> >>> These really are great! Thanks :) Sorry the reply has taken so
> >>> long.
> >>>
> >>>> Let's begin with "je la baise". Depending on the context this
> >>> may be
> >>>> "I kiss her" or "I fuck her". The context can tell us if we are
> >>> in a
> >>>> formal or colloquial type of language. Another issue is that in
> >>> this
> >>>> case the anaphora resolution can also help us: if the pronoun
> >>>> reference is "hand", it can only be "kiss"; if it is a person,
> >>> the
> >>>> doubt persists.
> >>>
> >>> For this, I would like to look at a concordance* of a large number
> >>> of
> >>> examples to see what kind of information can be used to
> >>> disambiguate.
> >>>
> >>> Intuitively it seems like knowing the genre (e.g. formal/informal)
> >>> would
> >>> help. But probably also statistics about subjects, objects and
> >>> adjuncts,
> >>> and what they (co-)refer with.
> >>>
> >>> * I tried to search on DuckDuckGo, but in the "internet" domain it
> >>> is very hard to find examples with "kiss", even with "moderated
> >>> search"
> >>> turned on.
> >>>
> >>> In fact, perhaps that could be a genre "safe translation"... :D
> >>>
> >>> Incidentally Google gives "I fuck her" as the translation. I'm
> >>> able to
> >>> get
> >>> "kiss" by adding "bouche" or "main".
> >>>
> >>> I think if we want to go by frequency we should have "fuck" if we
> >>> go
> >>> by safety we should have "kiss".
> >>>
> >>> Probably "humblement" or "vous" are also good indicators of the
> >>> "kiss"
> >>> meaning.
> >>>
> >>> Any better than that would require further investigation with a
> >>> concordance.
> >>>
> >>> In terms of the module, if we want to do informal/formal then my
> >>> previous
> >>> suggestion would work fine.
> >>>
> >>>> Another kind of problem is the Arpitan words "chamô" ("camel";
> >>> plural
> >>>> "camels") and "chamôs ("chamois"; unchanged in plural). So,
> >>>> translating into French, I got yesterday chamois in a Bible text
> >>> of
> >>>> Exodus xD  I solved it deciding in a CG rule that all "chamôs"
> >>>> (without nothing around in singular) are camels.
> >>>
> >>> As this is a different morphological paradigm, I would go with the
> >>>
> >>> superscript
> >>> notation ¹²³...
> >>>
> >>>> (Similar cases in
> >>>> French: fil/fils, foi/fois, cour/cours)
> >>>
> >>> These have different lemmas, e.g.
> >>>
> >>> ^fils/fil<n><m><pl>/fils<n><m><sp>$     threads / son*
> >>> ^fois/foi<n><f><pl>/fois<n><f><sp>$     faiths / time*
> >>> ^cours/cour<n><f><pl>/cours<n><m><sp>$  courts / course*
> >>>
> >>> The 'cour/cours' example can potentially be disambiguated by the
> >>> gender.
> >>>
> >>> The others I suppose rules could be written, but I suspect they
> >>> would be
> >>> quite brittle. My guess is that the <sp> ones are more frequent.
> >>> So
> >>> those
> >>> should be default, then the question is finding specific contexts
> >>> where
> >>> it should be the others. A concordance would help, but I'm not
> >>> sure how
> >>> they would be split by genre or semantic field. This is really a
> >>> problem
> >>> with how world-knowledge is encoded.
> >>>
> >>> I wonder if something could be done with word embeddings here. For
> >>>
> >>> example
> >>> my guess is that in the target language the two variants should
> >>> not
> >>> be close in the vector space. And they should be closer to words
> >>> in the
> >>> same semantic field. This could then be something like a
> >>> reweighting of the translations according to target language
> >>> semantic
> >>> coherence.
> >>>
> >>> Note that it would require information to be "backpropagated" from
> >>> the
> >>> target
> >>> language to the source language. Perhaps you could have something
> >>> like
> >>> per-reading embeddings that are trained using target language
> >>> information,
> >>>
> >>> so e.g. (fils, fil<n><m><pl>) [0.323, 0.423, 0.11, 0.595]
> >>> (fils, fils<n><m><sp>) [0.53, 0.605, 0.54, 0.639]
> >>>
> >>> Felipe did something like this in his thesis, but he only looked
> >>> at
> >>> sequences of part of speech tags. Here we need to know information
> >>> about
> >>> the actual analyses.
> >>>
> >>>> In French there are plenty of words with different meanings,
> >>> depending
> >>>> on the genre: livre, page, tour, etc. The problem is that often
> >>> the
> >>>> immediate surrounding context does not disambiguate: des livres,
> >>> les
> >>>> pages, de tour, etc.
> >>>
> >>> This sounds like it would work with some kind of longer distance,
> >>> bag-of-wordsy context module.
> >>>
> >>>> A similar but slightly different case is the word
> >>>> pairs homicide mf/homicide m, féminicide mf/féminicide m,
> >>> parricide
> >>>> mf/parricide, etc.: the one with the genre "mf" is a person and
> >>> the
> >>>> other is the action.
> >>>
> >>> This looks like the fil/fils problem.
> >>>
> >>>> Other problems come in lexical selection. For instance, as a
> >>> rule,
> >>>> Catalan preposition "de" is translated as "de" in French, but if
> >>> the
> >>>> following word is a material, "en" must be selected (de fusta >
> >>> en
> >>>> bois). So in the Catalan2French lrx file we have a list of
> >>> materials,
> >>>> as we have a list of countries, a list of musical instruments, a
> >>> list
> >>>> of animals, etc. I dream about a monolingual dictionary where we
> >>> could
> >>>> get this kind of information. It is not useful to have these
> >>> lists for
> >>>> many language pairs using Catalan. This information should be in
> >>>> apertium-cat and not in every apertium-cat-xxx lrx file.
> >>>
> >>> Yes, this kind of information should certainly go in the
> >>> monolingual
> >>> packages. A quick and simple approach would be to make a .metadix
> >>> style
> >>> format
> >>> where this information is stored around entries and generated in
> >>> .lrx
> >>> style
> >>> lists by an xslt script.
> >>>
> >>> <e lm="fusta" list="materia"><i>fust</i><par n="abell/a__n"/></e>
> >>> <e lm="ferro" list="materia"><i>ferro</i><par n="abric__n"/></e>
> >>>
> >>> ->
> >>>
> >>> <def-seq n="materia">
> >>> <match lemma="fusta"/>
> >>> <match lemma="ferro"/>
> >>> </def-seq>
> >>>
> >>> These could then be included into the .lrx file by the Makefile,
> >>> or
> >>> a separate, monolingual, file could be another argument to
> >>> lrx-comp.
> >>>
> >>>> Moreover, If we had words not only with different kind of
> >>> semantic
> >>>> labels, but also marked as synonyms, maybe it'd be possible to
> >>> give a
> >>>> translation using a word labeled as synonym (if it has a
> >>> translation)
> >>>> instead of "unknown".
> >>>
> >>> Not sure about this one, some concrete examples would help.
> >>
> >> For instance, we have in the bilingual dictionary that "tomàquet"
> >> is "tomato", but if in the text we have one of the many
> >> possibilities for "tomato" in Catalan, and we do not have added it
> >> in the bilingual dictionary, "tomàtiga", "domàtiga", "tomata",
> >> etc. could be translated into "tomato" if someone has not added all
> >> of them into the bilingual dictionary (as usually nobody does). The
> >> same is for low-frequence synonyms, e.g. old-fashioned, etc.
> >> (Spanish "congratular" for "felicitar", Catalan "maridar" for
> >> "casar", etc.).
> >
> > I'd add that one of the problems with that is that this synonyms may
> > be polysemic. For instance "bubota" seems to be quite widely used in
> > Balearic Catalan, but can mean both "scarecrow" and "ghost". Probably
> > just one of the two could be selected as synonym if "bubota" is
> > missing in a bilingual dictionary.
>
> Yep, this is the kind of thing that people are working on at the moment
> with neural machine translation. For example in translating informal
> texts, how do you make sure that you get the translations of "today",
> "2day" "tooday" "tday" etc, such as in:
>
>
> https://www.clsp.jhu.edu/workshops/19-workshop/improving-translation-of-informal-language/


This kind of problems are typical, and even very often, for languages
without or with a weak standard. For instance, currently in Arpitan along
the standard termination "ament" in many nouns and adverbs, I found tens of
"ement" and even "èment". Similarly instead of "ê" I found "è", or the
opposite, and instead of "â" I found "a", or the opposite. It's a big mess,
when I get "real" texts on the net. But defining in the monodix that every
"ê" can be "è", and the opposite, every "â" can be "a", and the opposite"
would cause a huge quantity of homonyms that would make disambiguation
almost impossible (so I won't do it).

So, this kind of improvement  may help translators from underresourced
languages... if enormous corpora are not required to learn the "rules".

Hèctor



>
>
> Fran
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

Reply via email to