Hi
I've added antipattern sanity checks.
It detects some problems in antipatterns for German
and Polish.
However, I have not checked-in yet because the
antiPattern.getId() is incorrect. It seems to contain the ID
of the previous rule, rather than the rule owning the
antipattern. I believe that the problem is in the SAX XML
parser, as </antipattern> is found before </pattern>
and the rule ID is set when encountering </pattern>
(not 100% sure whether that's the root cause).
I have not fixed that. Maybe Marcin can be quicker to
fix than me (hint...) :-)
I also see the same errors reported multiple times. I'm
not sure why yet, I'll investigate later before checking-in.
Here is my patch for detecting problems in antipatterns:
diff --git
a/languagetool-core/src/main/java/org/languagetool/rules/patterns/PatternRule.java
b/languagetool-core/src/main/java/org/languagetool/rules/patterns/PatternRule.java
index 326da86..03cdf7b 100644
---
a/languagetool-core/src/main/java/org/languagetool/rules/patterns/PatternRule.java
+++
b/languagetool-core/src/main/java/org/languagetool/rules/patterns/PatternRule.java
@@ -225,6 +225,13 @@ public class PatternRule extends AbstractPatternRule {
}
/**
+ * For testing only.
+ */
+ public final List<DisambiguationPatternRule> getAntiPatterns() {
+ return antiPatterns;
+ }
+
+ /**
* A fast check whether this rule can be ignored for the given sentence
* because it can never match. Used internally for performance
optimization.
* @since 2.4
diff --git
a/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java
b/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java
index 4f0b5e4..1b91087 100644
---
a/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java
+++
b/languagetool-core/src/test/java/org/languagetool/rules/patterns/PatternRuleTest.java
@@ -32,6 +32,8 @@ import org.languagetool.rules.IncorrectExample;
import org.languagetool.rules.Rule;
import org.languagetool.rules.RuleMatch;
import org.languagetool.rules.spelling.SpellingCheckRule;
+import
org.languagetool.tagging.disambiguation.rules.DisambiguationPatternRule;
+
/**
* @author Daniel Naber
@@ -158,8 +160,16 @@ public class PatternRuleTest extends TestCase {
rules.addAll(languageTool.loadPatternRules(patternRuleFileName));
}
for (PatternRule rule : rules) {
+ // Test the rule pattern.
PatternTestTools.warnIfRegexpSyntaxNotKosher(rule.getElements(),
rule.getId(), rule.getSubId(), lang);
+
+ // Test the rule antipatterns.
+ List<DisambiguationPatternRule> antiPatterns =
rule.getAntiPatterns();
+ for (DisambiguationPatternRule antiPattern : antiPatterns) {
+
PatternTestTools.warnIfRegexpSyntaxNotKosher(antiPattern.getElements(),
+ antiPattern.getId(), antiPattern.getSubId(), lang);
+ }
}
testGrammarRulesFromXML(rules, languageTool, allRulesLanguageTool,
lang);
System.out.println(rules.size() + " rules tested.");
And here are the problems it finds (keeping in mind that
the reported antipattern IDs are incorrect!)
The German rule: AUFRECHT_ERHALTEN_antipattern:2:null, token [3], contains
"kaufen" that is marked as regular expression but probably is not one.
Contrary to what the message reports, the bug is in
the 2nd antipattern INFINITIV_MIT_ZU which contains:
<rulegroup id="INFINITIV_MIT_ZU" name="Zusammen-/Getrenntschreibung:
Infinitiv mit 'zu'">
....
<antipattern>
<!-- 'gibt es nach wie vor zu kaufen' -->
<token>vor</token>
<token>zu</token>
<token regexp="yes">kaufen</token>
</antipattern>
Other errors:
The Polish rule: null_antipattern:7:null, token [1], contains "" that is
marked as inflected but is empty, so the attribute is redundant.
The Polish rule: BRAK_PRZECINKA_SPOJNIK_PROSTY_antipattern:1:null
(exception in POS tag of token [3]), token [3], contains
"(?:comp|interp):comma|SENT_END" that is not marked as regular expression
but probably is one.
The Polish rule: ZOSTALO_ZROBIONE_antipattern:1:null, token [1], contains
"" that is marked as inflected but is empty, so the attribute is redundant.
The Polish rule: ZOSTALO_ZROBIONE_antipattern:4:null, token [1], contains
"być" that is marked as regular expression but probably is not one.
Regards
Dominique
------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel