"Mikel L. Forcada" <[email protected]> čálii:

> Dear Apertiumers,
>
> my friend Àngel Calpe (copied) is trying to lemmatize pre-normative 
> Valencian texts using a modified version of apertium-cat (morphological 
> analyser and tagger). The texts are written in an XML application called 
> TEI [1].
>
> He would like the words analysed (in particular those marked in 
> <emph>...</emph>) to be wrapped in an element that has the lemma as an 
> attribute (I am not sure, they could be attributes of <emph> or enclosed 
> in an additional element, but that is probably a detail).
>
> Can you think of an easy hack that could be used to do this? Do we have 
> anything that we could repurpose for that?

The quickest hack would probably be to wrap anything to be ignored (e.g.
all of <teiHeader>) in <apertium-notrans>, then call apertium-deshtml on
the rest and translate up until the tagger, as a first stab:

awk 'BEGIN{print "<apertium-notrans>"} /<text>/{print
"</apertium-notrans>"} {print}' | \
  apertium-deshtml | \
  lt-proc -w cat.automorf.bin | \
  apertium-tagger -g -p cat.prob


then reformat the ^foos/foo<n><pl>$ into <elt a="foo">foos</elt> using
sed or streamparser.py or whatever you prefer, for example:

python3 -c '
from streamparser import *
import sys
for blank, lu in parse_file(sys.stdin, withText=True):
  if lu.knownness == known:
    lms = "/".join(s.baseform for r in lu.readings for s in r)
    word = "<tag a=\""+lms+"\">"+lu.wordform+"</tag>"
  else:
    word = lu.wordform
  print(blank+word, end="")
'

See
https://gist.github.com/unhammer/e4e54b45ffb99baa4608fb49e3accb56
for a full example using en-es

Attachment: signature.asc
Description: PGP signature

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to