W dniu 2014-09-03 06:22, Dominique Pellé pisze: > Hi > > Have a look in the following debug output > of LanguageTool where a token gets non-sensical > POS tag "N.*" (multiple times) after a disambiguation > rule is applied. > > Is it a bug in the disambiguator? > Or am writing an incorrect disambiguation rule? > > $ echo "An eil"| java -jar > languagetool-standalone/target/LanguageTool-2.7-SNAPSHOT/LanguageTool-2.7-SNAPSHOT/languagetool-commandline.jar > -c utf-8 -l br -v > Expected text language: Breton > Working on STDIN... > 664 rules activated for language Breton > <S> An[mont/V pres 1 s,monet/V pres 1 s,an/D e sp,] > eil[eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,</S>,]<P/> > Disambiguator log: > > UR_N:2 eil[eilañ/V pres 3 s,eilañ/V impe 2 s,eil/K e sp > o,eil/J,eilañ/SENT_END] -> > eil[eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/SENT_END] > > > Notice that the token "eil" gets POS tag "N.*" (which > is a invalid POS tag, it's not mean to be a regexp) and > furthermore, it gets that same POS tag 5 times after > disambiguation. > > The disambiguation rule UR_N:2 in > languagetool-language-modules/br/src/main/resources/org/languagetool/resource/br/disambiguation.xml > is... > > <rule> > <pattern> > <token regexp="yes">u[ln]|a[nlr]</token> > <marker> > <token postag="V.*" postag_regexp="yes"/> > </marker> > </pattern> > <disambig action="filter" postag="N.*"/> > </rule> > > The idea of the disambiguation rule is that, if the > word following "an" (or al, or ar, etc.) is a verb (V.*), > then keep only its noun POS tag (N.*) > in case it happens to be also a noun. > But obviously, this is not what's happening here.
Actually, this is not, strictly speaking, a bug. What happens is this: when you try to filter out a token that does not have a certain tag at all, the filter action simply replaces all existing tags with the POS tag you specified. I think it was convenient for some purposes or I didn't mind (I don't remember). The easiest way to prevent such things from happening is to add an additional condition to the pattern, for example: <marker> <and> <token postag="V.*" postag_regexp="yes"/> <token postag="N.*" postag_regexp="yes"/> </and> </marker> This will stop "eil" from getting wrong tags. We could, in principle, try to add this kind of test to the disambiguator action but I'm not sure if it won't break something. Regards, Marcin ------------------------------------------------------------------------------ Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel