incorrect antipattern IDs (bug in XML parser?) + antipattern sanity check

Dominique Pellé Mon, 05 May 2014 15:32:02 -0700

Hi

I've added antipattern sanity checks.


It detects some problems in antipatterns for German
and Polish.

However, I have not checked-in yet because the
antiPattern.getId() is incorrect. It seems to contain the ID
of the previous rule, rather than the rule owning the
antipattern.  I believe that the problem is in the SAX XML
parser, as </antipattern> is found before </pattern>
and the rule ID is set when encountering </pattern>
(not 100% sure whether that's the root cause).
I have not fixed that.  Maybe Marcin can be quicker to
fix than me (hint...) :-)

I also see the same errors reported multiple times. I'm
not sure why yet, I'll investigate later before checking-in.

Here is my patch for detecting problems in antipatterns:

diff --git
a/languagetool-core/src/main/java/org/languagetool/rules/patterns/PatternRule.java
b/languagetool-core/src/main/java/org/languagetool/rules/patterns/PatternRule.java
index 326da86..03cdf7b 100644
---
a/languagetool-core/src/main/java/org/languagetool/rules/patterns/PatternRule.java
+++
b/languagetool-core/src/main/java/org/languagetool/rules/patterns/PatternRule.java
@@ -225,6 +225,13 @@ public class PatternRule extends AbstractPatternRule {
   }

   /**
+   * For testing only.
+   */
+  public final List<DisambiguationPatternRule> getAntiPatterns() {
+    return antiPatterns;
+  }
+
+  /**
    * A fast check whether this rule can be ignored for the given sentence
    * because it can never match. Used internally for performance
optimization.
    * @since 2.4
diff --git
a/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java
b/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java
index 4f0b5e4..1b91087 100644
---
a/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java
+++
b/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java
@@ -32,6 +32,8 @@ import org.languagetool.rules.IncorrectExample;
 import org.languagetool.rules.Rule;
 import org.languagetool.rules.RuleMatch;
 import org.languagetool.rules.spelling.SpellingCheckRule;
+import
org.languagetool.tagging.disambiguation.rules.DisambiguationPatternRule;
+

 /**
  * @author Daniel Naber
@@ -158,8 +160,16 @@ public class PatternRuleTest extends TestCase {
       rules.addAll(languageTool.loadPatternRules(patternRuleFileName));
     }
     for (PatternRule rule : rules) {
+        // Test the rule pattern.
         PatternTestTools.warnIfRegexpSyntaxNotKosher(rule.getElements(),
                 rule.getId(), rule.getSubId(), lang);
+
+        // Test the rule antipatterns.
+        List<DisambiguationPatternRule> antiPatterns =
rule.getAntiPatterns();
+        for (DisambiguationPatternRule antiPattern : antiPatterns) {
+
PatternTestTools.warnIfRegexpSyntaxNotKosher(antiPattern.getElements(),
+              antiPattern.getId(), antiPattern.getSubId(), lang);
+        }
     }
     testGrammarRulesFromXML(rules, languageTool, allRulesLanguageTool,
lang);
     System.out.println(rules.size() + " rules tested.");



And here are the problems it finds (keeping in mind that
the reported antipattern IDs are incorrect!)

 The German rule: AUFRECHT_ERHALTEN_antipattern:2:null, token [3], contains
"kaufen" that is marked as regular expression but probably is not one.

Contrary to what the message reports, the bug is in
the 2nd antipattern INFINITIV_MIT_ZU which contains:

 <rulegroup id="INFINITIV_MIT_ZU" name="Zusammen-/Getrenntschreibung:
Infinitiv mit 'zu'">
            ....
            <antipattern>
                <!-- 'gibt es nach wie vor zu kaufen' -->
                <token>vor</token>
                <token>zu</token>
                <token regexp="yes">kaufen</token>
            </antipattern>


Other errors:
The Polish rule: null_antipattern:7:null, token [1], contains "" that is
marked as inflected but is empty, so the attribute is redundant.

The Polish rule: BRAK_PRZECINKA_SPOJNIK_PROSTY_antipattern:1:null
(exception in POS tag of token [3]), token [3], contains
"(?:comp|interp):comma|SENT_END" that is not marked as regular expression
but probably is one.
The Polish rule: ZOSTALO_ZROBIONE_antipattern:1:null, token [1], contains
"" that is marked as inflected but is empty, so the attribute is redundant.
The Polish rule: ZOSTALO_ZROBIONE_antipattern:4:null, token [1], contains
"być" that is marked as regular expression but probably is not one.


Regards
Dominique

------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
&#149; 3 signs your SCM is hindering your productivity
&#149; Requirements for releasing software faster
&#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce

_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

incorrect antipattern IDs (bug in XML parser?) + antipattern sanity check

Reply via email to