Hi,
We attempted to create all the files needed for supervised training.
We assumed trying to interpret the usage info output by apertium-tagger
that the input files should have the following format:
apertium-tagger[-d] -s=n DIC CRP TSX TAGGER_DATA HTAG UNTAG
DIC: full expanded dictionary file
CRP: training text corpus file
TSX: tagger specification file, in XML format
TAGGER_DATA: tagger data file, built in the training and used while
tagging
HTAG: hand-tagged text corpus
UNTAG: untagged text corpus, morphological analysis of HTAG
corpus to use both jointly with -s option
INPUT: input file, stdin by default
OUTPUT: output file, stdout by default
DIC was created using lt-comp lr from a .dix file we generated from the
output of our morphological analyzer. This should be OK, lt-comp seemed to
like it.
CRP: we assumed that this should be just a tokenized plain text file (just
word forms of the training corpus). We escaped special characters like
[]{}^$ and \. We have new lines in this file tough.
TSX is an xml listing our tags.
It looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<tagger name="Hungarian">
<tagset>
<def-label name="DET">
<tags-item tags="DET"/>
</def-label>
<def-label name="DET|NM">
<tags-item tags="DET|NM"/>
</def-label>
<def-label name="ELO">
<tags-item tags="ELO"/>
</def-label>
<def-label name="FF.FN.PSe3.ACC">
<tags-item tags="FF.FN.PSe3.ACC"/>
....
<def-label name="SZN|NM._MUL">
<tags-item tags="SZN|NM._MUL"/>
</def-label>
<def-label name="X">
<tags-item tags="X"/>
</def-label>
</tagset>
</tagger>
We expect TAGGER_DATA to be created by the tagger training process.
HTAG: We assumed that this should be the training corpus again but this
time in apertium stream format and that it should contain exactly one
analysis for each token:
^Ezt/ez<FN|NM><ACC>$ ^pedig/pedig<KOT>$ ^a/a<DET>$
^németekkel/német<MN><PL><INS>$ ^való/való<MN><NOM>$
^ilyesfajta/ilyesfajta<MN|NM><NOM>$ ^együttműködés/együttműködés<FN><NOM>$
UNTAG: from the usage info we assumed that this should also be the training
corpus in apertium stream format but this time it should contain all
analyses returned by the morphological analyzer for each token:
^Ezt/ez<FN|NM><ACC>$ ^pedig/pedig<KOT>$ ^a/a<DET>/a<FN><NOM>/a<FN|NM>$
^németekkel/német<MN><PL><INS>$ ^való/való<FN><NOM>/való<MN><NOM>$
^ilyesfajta/ilyesfajta<MN|NM><NOM>$ ^együttműködés/együttműködés<FN><NOM>$
The snag is that when we try to run supervised training using these input
files, apertium-tagger outputs:
Calculating ambiguity classes...
Then it dies after eating up all memory.
Can you suggest something?
Thanks,
Attila, Gyorgy
On Mon, Apr 2, 2012 at 14:25, Jimmy O'Regan <[email protected]> wrote:
> 2012/4/2 Orosz György <[email protected]>:
> > We are one step closer now, but just wondering if there is any easy way
> to
> > create a .dix file from Apertium stream format. (or any easy way to use
> an
> > analysed text file for the tagger, instead of enumerating lemmata and
> > paradigms.)
>
> That usually involves much messing around with sed and the like, so
> here's a little program for that: https://gist.github.com/2283105
>
> g++ apertium-cleanstream.cc -o apertium-cleanstream
>
> cat $YOURTEXT | ./apertium-cleanstream -n|sort|uniq
>
> should give you what you need.
>
> --
> <Sefam> Are any of the mentors around?
> <jimregan> yes, they're the ones trolling you
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Apertium-stuff mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff