Hello all,
It looks like the Brill's tagger encoding problem was not actually a problem. I mean, the generated file indeed displayed incorrectly in nano, gedit, etc, but the tagger scripts and executables had no problem to use the correct encoding to do their job. However, I had to make a few changes to some source files to make the tagger work but now it produces nice results (accuracy is much higher than the one obtained from TreeTagger on the same corpus with the same tagset). Regards, Panos _____ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Panos Sent: Tuesday, November 11, 2008 1:26 AM To: [email protected] Subject: [Moses-support] Brill's tagger and some questions about factoredmodels Hi all, I would like some help from anyone with experience in Brill's pos tagger. I have some encoding problems in the output file after I run the rule learner (unkown-lexical-learn.prl script). All the generated files up to this point are in UTF-8 encoding and appear fine, but the output file created by the rule learner cannot display Greek characters correctly. Hoping that this issue will be solved and I will finally manage to create a Greek pos tagger, I was wondering if tagging the English source text too and adding more generation/translation steps will produce better results. Also, the Greek pos tagset I use is the one provided by Xerox. In that case, I suppose I will have to use the same tagset for tagging the English text too, right? I am not very happy with Xerox's tagset because it is fairly limited and I think Brill's English tagger uses the Penn's tagset, so I am a little confused. Should I just use the Penn tag set for tagging the Greek text too? (I have only tagged 2000 words so far, and I could modify accordingly the tags without too much effort.) Thanks Panos
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
