W dniu 2014-09-03 06:22, Dominique Pellé pisze:
> Hi
>
> Have a look in the following debug output
> of LanguageTool where a token gets non-sensical
> POS tag "N.*" (multiple times) after a disambiguation
> rule is applied.
>
> Is it a bug in the disambiguator?
> Or am writing an incorrect disambiguation rule?
>
> $ echo "An eil"| java -jar
> languagetool-standalone/target/LanguageTool-2.7-SNAPSHOT/LanguageTool-2.7-SNAPSHOT/languagetool-commandline.jar
> -c utf-8 -l br -v
> Expected text language: Breton
> Working on STDIN...
> 664 rules activated for language Breton
> <S> An[mont/V pres 1 s,monet/V pres 1 s,an/D e sp,]
> eil[eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,</S>,]<P/>
> Disambiguator log:
>
> UR_N:2 eil[eilañ/V pres 3 s,eilañ/V impe 2 s,eil/K e sp
> o,eil/J,eilañ/SENT_END] ->
> eil[eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/N.*,eilañ/SENT_END]
>
>
> Notice that the token "eil" gets POS tag "N.*" (which
> is a invalid POS tag, it's not mean to be a regexp) and
> furthermore, it gets that same POS tag 5 times after
> disambiguation.
>
> The disambiguation rule UR_N:2 in
> languagetool-language-modules/br/src/main/resources/org/languagetool/resource/br/disambiguation.xml
> is...
>
>      <rule>
>        <pattern>
>          <token regexp="yes">u[ln]|a[nlr]</token>
>          <marker>
>            <token postag="V.*" postag_regexp="yes"/>
>          </marker>
>        </pattern>
>        <disambig action="filter" postag="N.*"/>
>      </rule>
>
> The idea of the disambiguation rule is that, if the
> word following "an" (or al, or ar, etc.) is a verb (V.*),
> then keep only its noun POS tag (N.*)
> in case it happens to be also a noun.
> But obviously, this is not what's happening here.

Actually, this is not, strictly speaking, a bug. What happens is this: 
when you try to filter out a token that does not have a certain tag at 
all, the filter action simply replaces all existing tags with the POS 
tag you specified. I think it was convenient for some purposes or I 
didn't mind (I don't remember).

The easiest way to prevent such things from happening is to add an 
additional condition to the pattern, for example:

           <marker>
                <and>
                    <token postag="V.*" postag_regexp="yes"/>
                    <token postag="N.*" postag_regexp="yes"/>
                </and>
           </marker>

This will stop "eil" from getting wrong tags.

We could, in principle, try to add this kind of test to the 
disambiguator action but I'm not sure if it won't break something.

Regards,
Marcin

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to