Thanks Jaume. I'll do that soon for that disambiguation rule. However, if it's not too hard (?) it would be better in my opinion for <disambig action="filter" postag="..."/> to be a no-op if the postag regex does not match anything. I would make disambiguation rules simpler. It would also be less dangerous. It's easy to make the same mistake as I did. I suspect that several other disambiguation rules have the same kind of mistake in several languages. I only found this one by chance by looking at the -v debug output of LanguageTool. I'm not aware of a systematic way to find all other similar mistakes.
Regards Dominique Jaume Ortolà i Font <jaumeort...@gmail.com> wrote: > Dominique, > > As far as I remember (it is documented somwhere), that is what happens when > you try to filter a non-existent tag. You try to filter "N.*" but there is > no N.* tag in the token. In your sentence "eil" is not tagged with N. > > You need something like this: > > <rule> > <pattern> > <token regexp="yes">u[ln]|a[nlr]</token> > <marker> > <and> > > <token postag="V.*" postag_regexp="yes"/> > <token postag="N.*" postag_regexp="yes"/> > <and> > > </marker> > </pattern> > <disambig action="filter" postag="N.*"/> > </rule> > > > Regards, > Jaume Ortolà > > > > > 2014-09-03 6:22 GMT+02:00 Dominique Pellé <dominique.pe...@gmail.com>: >> >> Hi >> >> Have a look in the following debug output >> of LanguageTool where a token gets non-sensical >> POS tag "N.*" (multiple times) after a disambiguation >> rule is applied. >> >> Is it a bug in the disambiguator? >> Or am writing an incorrect disambiguation rule? >> >> $ echo "An eil"| java -jar >> >> languagetool-standalone/target/LanguageTool-2.7-SNAPSHOT/LanguageTool-2.7-SNAPSHOT/languagetool-commandline.jar >> -c utf-8 -l br -v >> Expected text language: Breton >> Working on STDIN... >> 664 rules activated for language Breton >> <S> An[mont/V pres 1 s,monet/V pres 1 s,an/D e sp,] >> eil[eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,</S>,]<P/> >> Disambiguator log: >> >> UR_N:2 eil[eilañ/V pres 3 s,eilañ/V impe 2 s,eil/K e sp >> o,eil/J,eilañ/SENT_END] -> >> eil[eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/SENT_END] >> >> >> Notice that the token "eil" gets POS tag "N.*" (which >> is a invalid POS tag, it's not mean to be a regexp) and >> furthermore, it gets that same POS tag 5 times after >> disambiguation. >> >> The disambiguation rule UR_N:2 in >> >> languagetool-language-modules/br/src/main/resources/org/languagetool/resource/br/disambiguation.xml >> is... >> >> <rule> >> <pattern> >> <token regexp="yes">u[ln]|a[nlr]</token> >> <marker> >> <token postag="V.*" postag_regexp="yes"/> >> </marker> >> </pattern> >> <disambig action="filter" postag="N.*"/> >> </rule> >> >> The idea of the disambiguation rule is that, if the >> word following "an" (or al, or ar, etc.) is a verb (V.*), >> then keep only its noun POS tag (N.*) >> in case it happens to be also a noun. >> But obviously, this is not what's happening here. >> >> Regards >> Dominique >> >> >> ------------------------------------------------------------------------------ >> Slashdot TV. >> Video for Nerds. Stuff that matters. >> http://tv.slashdot.org/ >> _______________________________________________ >> Languagetool-devel mailing list >> Languagetool-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > > > ------------------------------------------------------------------------------ > Slashdot TV. > Video for Nerds. Stuff that matters. > http://tv.slashdot.org/ > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel