Re: [Apertium-stuff] Adding invariant prefixes

2019-09-18 Thread Hèctor Alòs i Font
Missatge de Jaume Ortolà i Font  del dia dc., 18 de
set. 2019 a les 11:05:

> Thanks for the answers.
>
> Missatge de Jonathan Washington  del dia
> dt., 17 de set. 2019 a les 22:11:
>
>> Jaume, are you planning on using this for translation or something else?
>> If for translation, how do you anticipate it improving translation quality?
>>
>
> These prefixes will be used for translating spa-cat, and they could be
> used also for other Romanic languages pairs. Hèctor Alòs is interested in
> it.
>
> I have tried the first option proposed by Kevin with just adjectives and
> some prefixes in Spanish:
>
> 
>   anti
>   pro
>   post
>   pospost
>   pre
>   
> 
>
> 
>   antiranti
>   prorpro
>   postpost
>   prerpre
>   antianti
>   propro
>   pospost
>   prepre
>   
> 
>
> In the Europarl corpus it finds around one new word (untranslated so far)
> every 5000 sentences. A few more prefixes can be added, and the same would
> be done with nouns and verbs.
>
> We'll need to create metadix files so that the dictionaries don't become
> cluttered with the new tags. The metadix will be useful also for other
> things.
>
> Some new words formed with prefixes can match existing words. All these
> should be discarded beforehand.
> prefiero (verb) = pre + fiero (adj)
> presumo (verb) = pre + sumo (adj)
> prerrogativa (noun) = pre + (r)rogativa (adj)
>
> I have tried adding a mark to the newly formed words and removing it with
> CG if necessary. It works fine.
>
> pre-prefix-pre
>
> REMOVE:prefixes ("-prefix-.*"r) IF (0 ("-prefix-.*"r));
>
> I think adding this feature is productive and worthwhile. What do you
> think (Hèctor, Marc, Xavi...)?
> Any suggestion to improve it?
>

It seems to me an ingenious way of guessing a word when it is missing in
the dictionaries. The system you propose seems robust and, if it is used
for a few prefixes that typically have equivalents in the nearby/target
languages, I do not see, a priori, much problem, especially for "long"
prefixes like "anti" or "post" (more problematic would be "re"). Also "LR"
and "RL" can be used if, for example", "post" is not problematic in Catalan
but "pos" is found be to so in Spanish. The system is obviously
overgenerating many words in monolingual dictionaries, but if someone does
not want to use the system you propose for a particular language pair it is
enough not to put the paradigm in the bilingual dictionary, or put it for
fewer prefixes. It has to be well tested, of course.
Anyway, it would probably be safer to differentiate between prefixes for
adjectives, names and verbs for minimizing unwanted overgenerations.

Hèctor

Jaume
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Adding invariant prefixes

2019-09-18 Thread Jaume Ortolà i Font
Missatge de Kevin Brubeck Unhammer  del dia dc., 18 de
set. 2019 a les 12:02:

> > I have tried adding a mark to the newly formed words and removing it with
> > CG if necessary. It works fine.
>
> Why not keep it all the way through the translator? That seems safer to
> me, and you don't have to worry that they may not be synonymous.
>

Some of these words can be very difficult to disambiguate (and to foresee):
prerrogativa (noun) vs. pre + (r)rogativa (adj). I found them because they
caused translation errors. So it is safer to remove the analysis with
prefix, and keep the original POS tags in the dictionary.
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Adding invariant prefixes

2019-09-18 Thread Kevin Brubeck Unhammer
Jaume Ortolà i Font
 čálii:

> I have tried adding a mark to the newly formed words and removing it with
> CG if necessary. It works fine.

Why not keep it all the way through the translator? That seems safer to
me, and you don't have to worry that they may not be synonymous.



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Adding invariant prefixes

2019-09-18 Thread Jaume Ortolà i Font
Thanks for the answers.

Missatge de Jonathan Washington  del dia
dt., 17 de set. 2019 a les 22:11:

> Jaume, are you planning on using this for translation or something else?
> If for translation, how do you anticipate it improving translation quality?
>

These prefixes will be used for translating spa-cat, and they could be used
also for other Romanic languages pairs. Hèctor Alòs is interested in it.

I have tried the first option proposed by Kevin with just adjectives and
some prefixes in Spanish:


  anti
  pro
  post
  pospost
  pre
  



  antiranti
  prorpro
  postpost
  prerpre
  antianti
  propro
  pospost
  prepre
  


In the Europarl corpus it finds around one new word (untranslated so far)
every 5000 sentences. A few more prefixes can be added, and the same would
be done with nouns and verbs.

We'll need to create metadix files so that the dictionaries don't become
cluttered with the new tags. The metadix will be useful also for other
things.

Some new words formed with prefixes can match existing words. All these
should be discarded beforehand.
prefiero (verb) = pre + fiero (adj)
presumo (verb) = pre + sumo (adj)
prerrogativa (noun) = pre + (r)rogativa (adj)

I have tried adding a mark to the newly formed words and removing it with
CG if necessary. It works fine.

pre-prefix-pre

REMOVE:prefixes ("-prefix-.*"r) IF (0 ("-prefix-.*"r));

I think adding this feature is productive and worthwhile. What do you think
(Hèctor, Marc, Xavi...)?
Any suggestion to improve it?

Jaume
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Adding invariant prefixes

2019-09-17 Thread Jonathan Washington
On Tue, Sep 17, 2019, 14:58 Kevin Brubeck Unhammer 
wrote:

>
> The upside is that you can combine words without listing everything
> twice. If you've only got one prefix, the HFST-like method is probably
> better. If you're combining lots, compounding may be worth considering.
>

We can and do implement compounding of that sort in HFST transducers too :)

Normally we don't bother separating derivational morphemes (in Turkic
transducers), though, unless they're extremely productive.

Jaume, are you planning on using this for translation or something else?
If for translation, how do you anticipate it improving translation quality?

--
Jonathan

>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Adding invariant prefixes

2019-09-17 Thread Kevin Brubeck Unhammer
Jaume Ortolà i Font
 čálii:

> Hi,
>
> I would like to be able to translate automatically certain words formed by
> "a certain prefix + a certain POS" without having to add new entries to the
> dictionaries. For example, any word formed by "anti" + any valid adjective
> in translations spa<>cat:
>
> antihúngaro <> antihongarès
> antihúngaras <> antihongareses
> antialemán <> antialemany
> antipluvial <> antipluvial
> antiestatista <> antiestatista
> ...
>
> The word forms and the POS tags would remain unchanged. (But in some
> languages some spelling changes may be necessary. In Spanish: "anti + ruso
> " becomes antirruso.)
>
> This feature could be used in a lot of language pairs. Has it been
> implemented anywhere? How could it be done?

You could have a  prepended to every ,


  anti
  

alemán

That would be similar to what people do with HFST.

-

In nno-nob I use the compounding feature of lttoolbox instead. The
relevant parts of the pardefs:



  
  


   



 
  


 


  anti
alemán


Then "anti" alone doesn't get an analysis (compound-only-L can only give
an analysis in compounds), but it can be analysed as a
prefix, if you use lt-proc with the -e argument:
^anti+alemán$

Pretransfer turns this into two lu's

^anti$ ^alemán$

The tags  and  are "special" – a compound
analysis can be made of one or more L's followed by an R. The tags are
hidden from the output when you use lt-proc -e.


The downside to this method is that every right-hand-side needs the tag
 on it, so if you had


 


that needs to be


 


etc.

You will also need transfer rules to remove the space added by
pretransfer, and chunk it etc.

The upside is that you can combine words without listing everything
twice. If you've only got one prefix, the HFST-like method is probably
better. If you're combining lots, compounding may be worth considering.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Adding invariant prefixes

2019-09-17 Thread Jaume Ortolà i Font
Hi,

I would like to be able to translate automatically certain words formed by
"a certain prefix + a certain POS" without having to add new entries to the
dictionaries. For example, any word formed by "anti" + any valid adjective
in translations spa<>cat:

antihúngaro <> antihongarès
antihúngaras <> antihongareses
antialemán <> antialemany
antipluvial <> antipluvial
antiestatista <> antiestatista
...

The word forms and the POS tags would remain unchanged. (But in some
languages some spelling changes may be necessary. In Spanish: "anti + ruso
" becomes antirruso.)

This feature could be used in a lot of language pairs. Has it been
implemented anywhere? How could it be done?

Jaume Ortolà
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff