Jaume Ortolà i Font <jaumeort...@gmail.com> čálii: > Hi, > > I would like to be able to translate automatically certain words formed by > "a certain prefix + a certain POS" without having to add new entries to the > dictionaries. For example, any word formed by "anti" + any valid adjective > in translations spa<>cat: > > antihúngaro <> antihongarès > antihúngaras <> antihongareses > antialemán <> antialemany > antipluvial <> antipluvial > antiestatista <> antiestatista > ... > > The word forms and the POS tags would remain unchanged. (But in some > languages some spelling changes may be necessary. In Spanish: "anti + ruso > " becomes antirruso.) > > This feature could be used in a lot of language pairs. Has it been > implemented anywhere? How could it be done?
You could have a <par n="prefixes"> prepended to every <e>, <pardef n="prefixes"> <e><i>anti</i></e> <e><i/></e> </pardef> <e lm="alemán"><par n="prefixes"/><i>alemán</i><par n="foo__n"/></e> That would be similar to what people do with HFST. ----- In nno-nob I use the compounding feature of lttoolbox instead. The relevant parts of the pardefs: <pardef n="cp-L"> <e r="RL"><p><l></l> <r><s n="cmp"/></r></p></e> <e r="LR"><p><l></l> <r><s n="cmp"/><s n="compound-only-L"/></r></p></e> </pardef> <pardef n="ned\only-L__n"> <e> <p><l></l> <r><s n="n"/><s n="sp"/></r></p><par n="cp-L"/></e> </pardef> <pardef n="cp-R"> <e> <p><l></l> <r></r></p></e> <e r="LR"><p><l></l> <r><s n="compound-R"/></r></p></e> </pardef> <pardef n="ep__n"> <e> <p><l></l> <r><s n="n"/><s n="sg"/></r></p><par n="cp-R"/></e> </pardef> <e lm="anti"> <i>anti</i><par n="ned\only-L__n"/></e> <e lm="alemán"> <i>alemán</i><par n="ep__n"/></e> Then "anti" alone doesn't get an analysis (compound-only-L can only give an analysis in compounds), but it can be analysed as a prefix, if you use lt-proc with the -e argument: ^anti<n><sp><cmp>+alemán<n><sg>$ Pretransfer turns this into two lu's ^anti<n><sp><cmp>$ ^alemán<n><sg>$ The tags <compound-only-L> and <compound-R> are "special" – a compound analysis can be made of one or more L's followed by an R. The tags are hidden from the output when you use lt-proc -e. The downside to this method is that every right-hand-side needs the tag <compound-R> on it, so if you had <pardef n="ep__n"> <e> <p><l></l> <r><s n="adj"/><s n="sg"/></r></p></e> </pardef> that needs to be <pardef n="ep__n"> <e> <p><l></l> <r><s n="adj"/><s n="sg"/></r></p><par n="cp-R"/></e> </pardef> etc. You will also need transfer rules to remove the space added by pretransfer, and chunk it etc. The upside is that you can combine words without listing everything twice. If you've only got one prefix, the HFST-like method is probably better. If you're combining lots, compounding may be worth considering.
signature.asc
Description: PGP signature
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff