"Mikel L. Forcada" <[email protected]> čálii: > Dear Apertiumers, > > my friend Àngel Calpe (copied) is trying to lemmatize pre-normative > Valencian texts using a modified version of apertium-cat (morphological > analyser and tagger). The texts are written in an XML application called > TEI [1]. > > He would like the words analysed (in particular those marked in > <emph>...</emph>) to be wrapped in an element that has the lemma as an > attribute (I am not sure, they could be attributes of <emph> or enclosed > in an additional element, but that is probably a detail). > > Can you think of an easy hack that could be used to do this? Do we have > anything that we could repurpose for that?
The quickest hack would probably be to wrap anything to be ignored (e.g.
all of <teiHeader>) in <apertium-notrans>, then call apertium-deshtml on
the rest and translate up until the tagger, as a first stab:
awk 'BEGIN{print "<apertium-notrans>"} /<text>/{print
"</apertium-notrans>"} {print}' | \
apertium-deshtml | \
lt-proc -w cat.automorf.bin | \
apertium-tagger -g -p cat.prob
then reformat the ^foos/foo<n><pl>$ into <elt a="foo">foos</elt> using
sed or streamparser.py or whatever you prefer, for example:
python3 -c '
from streamparser import *
import sys
for blank, lu in parse_file(sys.stdin, withText=True):
if lu.knownness == known:
lms = "/".join(s.baseform for r in lu.readings for s in r)
word = "<tag a=\""+lms+"\">"+lu.wordform+"</tag>"
else:
word = lu.wordform
print(blank+word, end="")
'
See
https://gist.github.com/unhammer/e4e54b45ffb99baa4608fb49e3accb56
for a full example using en-es
signature.asc
Description: PGP signature
------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
