Re: [Apertium-stuff] Adding invariant prefixes
Missatge de Jaume Ortolà i Font del dia dc., 18 de set. 2019 a les 11:05: > Thanks for the answers. > > Missatge de Jonathan Washington del dia > dt., 17 de set. 2019 a les 22:11: > >> Jaume, are you planning on using this for translation or something else? >> If for translation, how do you anticipate it improving translation quality? >> > > These prefixes will be used for translating spa-cat, and they could be > used also for other Romanic languages pairs. Hèctor Alòs is interested in > it. > > I have tried the first option proposed by Kevin with just adjectives and > some prefixes in Spanish: > > > anti > pro > post > pospost > pre > > > > > antiranti > prorpro > postpost > prerpre > antianti > propro > pospost > prepre > > > > In the Europarl corpus it finds around one new word (untranslated so far) > every 5000 sentences. A few more prefixes can be added, and the same would > be done with nouns and verbs. > > We'll need to create metadix files so that the dictionaries don't become > cluttered with the new tags. The metadix will be useful also for other > things. > > Some new words formed with prefixes can match existing words. All these > should be discarded beforehand. > prefiero (verb) = pre + fiero (adj) > presumo (verb) = pre + sumo (adj) > prerrogativa (noun) = pre + (r)rogativa (adj) > > I have tried adding a mark to the newly formed words and removing it with > CG if necessary. It works fine. > > pre-prefix-pre > > REMOVE:prefixes ("-prefix-.*"r) IF (0 ("-prefix-.*"r)); > > I think adding this feature is productive and worthwhile. What do you > think (Hèctor, Marc, Xavi...)? > Any suggestion to improve it? > It seems to me an ingenious way of guessing a word when it is missing in the dictionaries. The system you propose seems robust and, if it is used for a few prefixes that typically have equivalents in the nearby/target languages, I do not see, a priori, much problem, especially for "long" prefixes like "anti" or "post" (more problematic would be "re"). Also "LR" and "RL" can be used if, for example", "post" is not problematic in Catalan but "pos" is found be to so in Spanish. The system is obviously overgenerating many words in monolingual dictionaries, but if someone does not want to use the system you propose for a particular language pair it is enough not to put the paradigm in the bilingual dictionary, or put it for fewer prefixes. It has to be well tested, of course. Anyway, it would probably be safer to differentiate between prefixes for adjectives, names and verbs for minimizing unwanted overgenerations. Hèctor Jaume > > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Adding invariant prefixes
Missatge de Kevin Brubeck Unhammer del dia dc., 18 de set. 2019 a les 12:02: > > I have tried adding a mark to the newly formed words and removing it with > > CG if necessary. It works fine. > > Why not keep it all the way through the translator? That seems safer to > me, and you don't have to worry that they may not be synonymous. > Some of these words can be very difficult to disambiguate (and to foresee): prerrogativa (noun) vs. pre + (r)rogativa (adj). I found them because they caused translation errors. So it is safer to remove the analysis with prefix, and keep the original POS tags in the dictionary. ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Adding invariant prefixes
Jaume Ortolà i Font čálii: > I have tried adding a mark to the newly formed words and removing it with > CG if necessary. It works fine. Why not keep it all the way through the translator? That seems safer to me, and you don't have to worry that they may not be synonymous. signature.asc Description: PGP signature ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Adding invariant prefixes
Thanks for the answers. Missatge de Jonathan Washington del dia dt., 17 de set. 2019 a les 22:11: > Jaume, are you planning on using this for translation or something else? > If for translation, how do you anticipate it improving translation quality? > These prefixes will be used for translating spa-cat, and they could be used also for other Romanic languages pairs. Hèctor Alòs is interested in it. I have tried the first option proposed by Kevin with just adjectives and some prefixes in Spanish: anti pro post pospost pre antiranti prorpro postpost prerpre antianti propro pospost prepre In the Europarl corpus it finds around one new word (untranslated so far) every 5000 sentences. A few more prefixes can be added, and the same would be done with nouns and verbs. We'll need to create metadix files so that the dictionaries don't become cluttered with the new tags. The metadix will be useful also for other things. Some new words formed with prefixes can match existing words. All these should be discarded beforehand. prefiero (verb) = pre + fiero (adj) presumo (verb) = pre + sumo (adj) prerrogativa (noun) = pre + (r)rogativa (adj) I have tried adding a mark to the newly formed words and removing it with CG if necessary. It works fine. pre-prefix-pre REMOVE:prefixes ("-prefix-.*"r) IF (0 ("-prefix-.*"r)); I think adding this feature is productive and worthwhile. What do you think (Hèctor, Marc, Xavi...)? Any suggestion to improve it? Jaume ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Adding invariant prefixes
On Tue, Sep 17, 2019, 14:58 Kevin Brubeck Unhammer wrote: > > The upside is that you can combine words without listing everything > twice. If you've only got one prefix, the HFST-like method is probably > better. If you're combining lots, compounding may be worth considering. > We can and do implement compounding of that sort in HFST transducers too :) Normally we don't bother separating derivational morphemes (in Turkic transducers), though, unless they're extremely productive. Jaume, are you planning on using this for translation or something else? If for translation, how do you anticipate it improving translation quality? -- Jonathan > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Adding invariant prefixes
Jaume Ortolà i Font čálii: > Hi, > > I would like to be able to translate automatically certain words formed by > "a certain prefix + a certain POS" without having to add new entries to the > dictionaries. For example, any word formed by "anti" + any valid adjective > in translations spa<>cat: > > antihúngaro <> antihongarès > antihúngaras <> antihongareses > antialemán <> antialemany > antipluvial <> antipluvial > antiestatista <> antiestatista > ... > > The word forms and the POS tags would remain unchanged. (But in some > languages some spelling changes may be necessary. In Spanish: "anti + ruso > " becomes antirruso.) > > This feature could be used in a lot of language pairs. Has it been > implemented anywhere? How could it be done? You could have a prepended to every , anti alemán That would be similar to what people do with HFST. - In nno-nob I use the compounding feature of lttoolbox instead. The relevant parts of the pardefs: anti alemán Then "anti" alone doesn't get an analysis (compound-only-L can only give an analysis in compounds), but it can be analysed as a prefix, if you use lt-proc with the -e argument: ^anti+alemán$ Pretransfer turns this into two lu's ^anti$ ^alemán$ The tags and are "special" – a compound analysis can be made of one or more L's followed by an R. The tags are hidden from the output when you use lt-proc -e. The downside to this method is that every right-hand-side needs the tag on it, so if you had that needs to be etc. You will also need transfer rules to remove the space added by pretransfer, and chunk it etc. The upside is that you can combine words without listing everything twice. If you've only got one prefix, the HFST-like method is probably better. If you're combining lots, compounding may be worth considering. signature.asc Description: PGP signature ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
[Apertium-stuff] Adding invariant prefixes
Hi, I would like to be able to translate automatically certain words formed by "a certain prefix + a certain POS" without having to add new entries to the dictionaries. For example, any word formed by "anti" + any valid adjective in translations spa<>cat: antihúngaro <> antihongarès antihúngaras <> antihongareses antialemán <> antialemany antipluvial <> antipluvial antiestatista <> antiestatista ... The word forms and the POS tags would remain unchanged. (But in some languages some spelling changes may be necessary. In Spanish: "anti + ruso " becomes antirruso.) This feature could be used in a lot of language pairs. Has it been implemented anywhere? How could it be done? Jaume Ortolà ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff