Hello all,

 

It looks like the Brill's tagger encoding problem was not actually a
problem. I mean, the generated file indeed displayed incorrectly in nano,
gedit, etc, but the tagger scripts and executables had no problem to use the
correct encoding to do their job. However, I had to make a few changes to
some source files to make the tagger work but now it produces nice results
(accuracy is much higher than the one obtained from TreeTagger on the same
corpus with the same tagset).

 

Regards,

 

Panos

 

 

  _____  

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Panos
Sent: Tuesday, November 11, 2008 1:26 AM
To: [email protected]
Subject: [Moses-support] Brill's tagger and some questions about
factoredmodels

 

Hi all,

 

I would like some help from anyone with experience in Brill's pos tagger. I
have some encoding problems in the output file after I run the rule learner
(unkown-lexical-learn.prl script). All the generated files up to this point
are in UTF-8 encoding and appear fine, but the output file created by the
rule learner cannot display Greek characters correctly.

 

Hoping that this issue will be solved and I will finally manage to create a
Greek pos tagger, I was wondering if tagging the English source text too and
adding more generation/translation steps will produce better results. Also,
the Greek pos tagset I use is the one provided by Xerox. In that case, I
suppose I will have to use the same tagset for tagging the English text too,
right? I am not very happy with Xerox's tagset because it is fairly limited
and I think Brill's English tagger uses the Penn's tagset, so I am a little
confused. Should I just use the Penn tag set for tagging the Greek text too?
(I have only tagged 2000 words so far, and I could modify accordingly the
tags without too much effort.)

 

Thanks

 

Panos

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to