Jaume Ortolà i Font
<jaumeort...@gmail.com> čálii:

> Hi,
>
> I would like to be able to translate automatically certain words formed by
> "a certain prefix + a certain POS" without having to add new entries to the
> dictionaries. For example, any word formed by "anti" + any valid adjective
> in translations spa<>cat:
>
> antihúngaro <> antihongarès
> antihúngaras <> antihongareses
> antialemán <> antialemany
> antipluvial <> antipluvial
> antiestatista <> antiestatista
> ...
>
> The word forms and the POS tags would remain unchanged. (But in some
> languages some spelling changes may be necessary. In Spanish: "anti + ruso
> " becomes antirruso.)
>
> This feature could be used in a lot of language pairs. Has it been
> implemented anywhere? How could it be done?

You could have a <par n="prefixes"> prepended to every <e>,

<pardef n="prefixes">
  <e><i>anti</i></e>
  <e><i/></e>
</pardef>
<e lm="alemán"><par n="prefixes"/><i>alemán</i><par n="foo__n"/></e>

That would be similar to what people do with HFST.

-----

In nno-nob I use the compounding feature of lttoolbox instead. The
relevant parts of the pardefs:


<pardef n="cp-L">
  <e r="RL"><p><l></l>            <r><s n="cmp"/></r></p></e>
  <e r="LR"><p><l></l>            <r><s n="cmp"/><s 
n="compound-only-L"/></r></p></e>
</pardef>
<pardef n="ned\only-L__n">
  <e>       <p><l></l>          <r><s n="n"/><s n="sp"/></r></p><par 
n="cp-L"/></e>
</pardef>

<pardef n="cp-R">
  <e>       <p><l></l>            <r></r></p></e>
  <e r="LR"><p><l></l>            <r><s n="compound-R"/></r></p></e>
</pardef>
<pardef n="ep__n">
  <e>       <p><l></l>    <r><s n="n"/><s n="sg"/></r></p><par n="cp-R"/></e>
</pardef>

<e lm="anti">      <i>anti</i><par n="ned\only-L__n"/></e>
<e lm="alemán">    <i>alemán</i><par n="ep__n"/></e>


Then "anti" alone doesn't get an analysis (compound-only-L can only give
an analysis in compounds), but it can be analysed as a
prefix, if you use lt-proc with the -e argument:
^anti<n><sp><cmp>+alemán<n><sg>$

Pretransfer turns this into two lu's

^anti<n><sp><cmp>$ ^alemán<n><sg>$

The tags <compound-only-L> and <compound-R> are "special" – a compound
analysis can be made of one or more L's followed by an R. The tags are
hidden from the output when you use lt-proc -e.


The downside to this method is that every right-hand-side needs the tag
<compound-R> on it, so if you had

<pardef n="ep__n">
  <e>       <p><l></l>    <r><s n="adj"/><s n="sg"/></r></p></e>
</pardef>

that needs to be

<pardef n="ep__n">
  <e>       <p><l></l>    <r><s n="adj"/><s n="sg"/></r></p><par n="cp-R"/></e>
</pardef>

etc.

You will also need transfer rules to remove the space added by
pretransfer, and chunk it etc.

The upside is that you can combine words without listing everything
twice. If you've only got one prefix, the HFST-like method is probably
better. If you're combining lots, compounding may be worth considering.

Attachment: signature.asc
Description: PGP signature

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to