Actually what I wanted to try is to add a check in the grammar rule that takes
into account also the person in which the verb is used so that we check the
tense of verbs only if they are in the same person.
Therefore, instead of adding the exception as Dominique was suggesting (that
btw is something that we could try), the rule would match
raccolse[raccogliere/VER:ind+past+3+s]
with
viaggio[viaggio/NOUN-M:s,viaggiare/VER:ind+pres+1+s]
because they are not in the same person. I assume that if someone was to make a
verb tense mistake in writing a sentence they would, at least, use the same
person.
Any suggestion on how this could be achieved? I guess that it would be much
easier in Java than with regexps
Thanks
Paolo
________________________________
From: Dominique Pellé <dominique.pe...@gmail.com>
To: development discussion for LanguageTool
<languagetool-devel@lists.sourceforge.net>
Sent: Thursday, December 27, 2012 11:24 PM
Subject: Re: Italian Language enhancements
Mauro Condarelli <mc5...@mclink.it> wrote:
Hi,
>I'm trying to use LT for Italian.
>There are a lot of false positives in my language, so I started to
look around to enhance the rules.
>
>I found out many false-positives come from incorrect tagging (to be
more precise: lack of disambiguation), so I tried to implement some
very simple disambiguation.
>Unfortunately it doesn't seem to work. At end of message you find my
changes.
>
>My first test sentence is:
>"Prima di lasciarsi il tempo di pensare troppo raccolse zaino e
bastone da viaggio e, con un lungo passo determinato, attraversò la
soglia."
>
>Test results are:
>" Starting check in Italian...
>
>1. Line 1, column 47
>Message: Controllare il tempo dei verbi utilizzati nella frase. (deactivate)
>Context: ...di lasciarsi il tempo di pensare troppo raccolse zaino e bastone
>da viaggioe, con un lungo passo determinato, attr...
>
>Potential problems found: 1 (time: 25ms)"
>
>Which is absolutely wrong because the highlighted part contains just
one verb ("raccolse").
>
>Tagging gives:
>" <S> Prima[primo/ADJ:pos+f+s,
>prima/ADV]di[di/PRE]lasciarsi[lasciare/VER:inf+pres+si]il[il/ART-M:s]tempo[tempo/NOUN-M:s]di[di/PRE]pensare[pensare/VER:inf+pres]troppo[troppo/ADV,
> troppo/ADJ:pos+m+s,
>troppo/DET-INDEF:m+s]raccolse[raccogliere/VER:ind+past+3+s]zaino[zaino/NOUN-M:s]e[e/CON]bastone[bastone/NOUN-M:s]da[da/PRE]viaggio[viaggio/NOUN-M:s,
>
>viaggiare/VER:ind+pres+1+s]e[e/CON],[,/PON]con[con/PRE]un[un/ART-M:s]lungo[lungo/ADJ:pos+m+s,
> lungo/PRE]passo[passo/NOUN-M:s, passo/ADJ:pos+m+s,
>passare/VER:ind+pres+1+s]determinato[determinato/ADJ:pos+m+s,
>determinare/VER:part+past+s+m],[,/PON]attraversò[attraversare/VER:ind+past+3+s]la[la/PRO-PERS-CLI-3-F-S,
> la/ART-F:s]soglia[soglia/NOUN-F:s, solere/VER:cond+pres+2+s,
>solere/VER:cond+pres+1+s, solere/VER:cond+pres+3+s].[./SENT, </S>] "
>
>There's an ambiguity in the word "viaggio" which, taken alone, can
be either a noun ("trip", the correct meaning in this case) or a
verb ("I travel"), as correctly stated by tagging.
>I assume this is the reason for the false positive; can someone
confirm, please?
>
>I thus tried to avoid this particular error by adding the
disambiguating rules below.
>What I wanted to say is: "PREposition or ARTicle cannot immediately
preceded a VERb".
>
>Obviously I goofed somewhere because it didn't work (the above
results are *with* the changes).
>
>Can someone help me, please?
>TiA
>Mauro
>
>
>Index: src/main/java/org/languagetool/language/Italian.java
>===================================================================
>--- src/main/java/org/languagetool/language/Italian.java
(revision 8680)
>+++ src/main/java/org/languagetool/language/Italian.java (working
copy)
>@@ -32,11 +32,14 @@
> import org.languagetool.rules.WordRepeatRule;
> import org.languagetool.rules.it.MorfologikItalianSpellerRule;
> import org.languagetool.tagging.Tagger;
>+import org.languagetool.tagging.disambiguation.Disambiguator;
>+import
org.languagetool.tagging.disambiguation.rules.it.ItalianRuleDisambiguator;
> import org.languagetool.tagging.it.ItalianTagger;
>
> public class Italian extends Language {
>
> private Tagger tagger;
>+ private Disambiguator disambiguator;
>
> @Override
> public Locale getLocale() {
>@@ -77,6 +80,14 @@
> }
>
> @Override
>+ public final Disambiguator getDisambiguator() {
>+ if (disambiguator == null) {
>+ disambiguator = new ItalianRuleDisambiguator();
>+ }
>+ return disambiguator;
>+ }
>+
>+ @Override
> public Contributor[] getMaintainers() {
> final Contributor contributor = new Contributor("Paolo
Bianchini");
> return new Contributor[] { contributor };
>Index:
src/main/java/org/languagetool/tagging/disambiguation/rules/it/ItalianRuleDisambiguator.java
>===================================================================
>---
src/main/java/org/languagetool/tagging/disambiguation/rules/it/ItalianRuleDisambiguator.java
(revision 0)
>+++
src/main/java/org/languagetool/tagging/disambiguation/rules/it/ItalianRuleDisambiguator.java
(revision 0)
>@@ -0,0 +1,32 @@
>+/* LanguageTool, a natural language style checker
>+ * Copyright (C) 2007 Daniel Naber (http://www.danielnaber.de)
>+ *
>+ * This library is free software; you can redistribute it and/or
>+ * modify it under the terms of the GNU Lesser General Public
>+ * License as published by the Free Software Foundation; either
>+ * version 2.1 of the License, or (at your option) any later
version.
>+ *
>+ * This library is distributed in the hope that it will be useful,
>+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
>+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU
>+ * Lesser General Public License for more details.
>+ *
>+ * You should have received a copy of the GNU Lesser General Public
>+ * License along with this library; if not, write to the Free
Software
>+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
02110-1301
>+ * USA
>+ */
>+
>+package org.languagetool.tagging.disambiguation.rules.it;
>+
>+import org.languagetool.Language;
>+import
org.languagetool.tagging.disambiguation.rules.AbstractRuleDisambiguator;
>+
>+public class ItalianRuleDisambiguator extends
AbstractRuleDisambiguator {
>+
>+ @Override
>+ protected Language getLanguage() {
>+ return Language.ITALIAN;
>+ }
>+
>+}
>Index:
src/main/resources/org/languagetool/resource/it/disambiguation.xml
>===================================================================
>---
src/main/resources/org/languagetool/resource/it/disambiguation.xml
(revision 0)
>+++
src/main/resources/org/languagetool/resource/it/disambiguation.xml
(revision 0)
>@@ -0,0 +1,35 @@
>+<?xml version="1.0" encoding="utf-8"?>
>+<!-- Italian Disambiguation Rules for LanguageTool Copyright (C)
2012 Mauro
>+ Condarelli. See disambiguation.xsd for syntax. $Id: $ -->
>+<rules lang="it" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>+ xsi:noNamespaceSchemaLocation="../disambiguation.xsd">
>+ <rulegroup id="art-ver" name="ART+VER→delete">
>+ <rule>
>+ <pattern>
>+ <token postag="ART"></token>
>+ <marker>
>+ <token postag="VER"></token>
>+ </marker>
>+ </pattern>
>+ <disambig action="remove"
postag="VER"></disambig>
>+ </rule>
>+ <rule>
>+ <pattern>
>+ <token postag="ARTPRE"></token>
>+ <marker>
>+ <token postag="VER"></token>
>+ </marker>
>+ </pattern>
>+ <disambig action="remove"
postag="VER"></disambig>
>+ </rule>
>+ <rule>
>+ <pattern>
>+ <token postag="PRE"></token>
>+ <marker>
>+ <token postag="VER"></token>
>+ </marker>
>+ </pattern>
>+ <disambig action="remove"
postag="VER"></disambig>
>+ </rule>
>+ </rulegroup>
>+</rules>
>
>
>
Ciao Mauro
If you're developing a disambiguator, it's very useful to know that
the -v command line option will give you useful information. It
indicates among other things, which disambiguator rule(s) match
and which POS tag(s) get assigned to words as a result of
disambiguation. This was added by Marcin a few months ago
and I use it all the time :-)
In my experience, it is also very important to test your
disambiguation rules on several texts and think carefully about
them, because lousy disambiguation can cause more problems
than it solves. The disambiguation rules you add may fix your
particular example, but may break other sentences if they match
in unforeseen ways.
In your case, I suppose that you expected this disambiguation
rule to match...
+ <rule>
+ <pattern>
+ <token postag="PRE"></token>
+ <marker>
+ <token postag="VER"></token>
+ </marker>
+ </pattern>
+ <disambig action="remove"
postag="VER"></disambig>
+ </rule>
... but it did not match because the POS is "VER:ind+pres+1+s"
(not just VERB).
So you would need to use:<token postag="VER.*" postag_regexp="yes"></token>
But the disambiguation rule seems too general to me anyway. I have
not tried it, but I can imagine that the rule is not strict enough. It will
match something like "da prendere" as in "La strada da prendere…"
even though "prendere" here is a verb in that example.
In your example, I think that the grammar rule GR_10_001[4]
is also not strict enough. It uses several skip="-1" without
<exception> which is dangerous and which matches things in
unexpected ways.
$ echo "Prima di lasciarsi il tempo di pensare troppo raccolse zaino e bastone
da viaggio e, con un lungo passo determinato, attraversò la soglia." | java
-jar ~/sb/languagetool/dist/LanguageTool.jar -l it -v
Expected text language: Italian
Working on STDIN...
121 rules activated for language Italian
<S> Prima[primo/ADJ:pos+f+s,prima/ADV] di[di/PRE]
lasciarsi[lasciare/VER:inf+pres+si] il[il/ART-M:s] tempo[tempo/NOUN-M:s]
di[di/PRE] pensare[pensare/VER:inf+pres]
troppo[troppo/ADV,troppo/ADJ:pos+m+s,troppo/DET-INDEF:m+s]
raccolse[raccogliere/VER:ind+past+3+s] zaino[zaino/NOUN-M:s] e[e/CON]
bastone[bastone/NOUN-M:s] da[da/PRE]
viaggio[viaggio/NOUN-M:s,viaggiare/VER:ind+pres+1+s] e[e/CON],[,/PON]
con[con/PRE] un[un/ART-M:s] lungo[lungo/ADJ:pos+m+s,lungo/PRE]
passo[passo/NOUN-M:s,passo/ADJ:pos+m+s,passare/VER:ind+pres+1+s]
determinato[determinato/ADJ:pos+m+s,determinare/VER:part+past+s+m],[,/PON]
attraversò[attraversare/VER:ind+past+3+s] la[la/PRO-PERS-CLI-3-F-S,la/ART-F:s]
soglia[soglia/NOUN-F:s,solere/VER:cond+pres+2+s,solere/VER:cond+pres+1+s,solere/VER:cond+pres+3+s].[./SENT,</S>]<P/>
Disambiguator log:
1.) Line 1, column 47, Rule ID: GR_10_001[4]
Message: Controllare il tempo dei verbi utilizzati nella frase.
...rima di lasciarsi il tempo di pensare tropporaccolse zaino e bastone da
viaggio e, con un lungo passo determinato, attravers...
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Time: 153ms for 1 sentences (6.5 sentences/sec)
You could make the rule a bit less dangerous this way in case
disambiguation was not good enough:
$ svn diff grammar.xml
Index: grammar.xml
===================================================================
--- grammar.xml (revision 8681)
+++ grammar.xml (working copy)
@@ -737,7 +737,10 @@
<!--
<token
postag="(VER.ind.imp.*.*)|(VER.ind.fut.*.*)|(VER.ind.pres.*.*)"
postag_regexp="yes"><exception scope="previous" postag="(ART-F.*)|(ART-M.*)"
postag_regexp="yes"></exception></token>
-->
- <token postag="(VER.ind.fut.*.*)|(VER.ind.pres.*.*)"
postag_regexp="yes"><exception scope="previous" postag="(ART-F.*)|(ART-M.*)"
postag_regexp="yes"></exception></token>
+ <token postag="(VER.ind.fut.*.*)|(VER.ind.pres.*.*)"
postag_regexp="yes">
+ <exception postag="NOUN.*" postag_regexp="yes"/>
+ <exception scope="previous" postag="(ART-F.*)|(ART-M.*)"
postag_regexp="yes"></exception>
+ </token>
<!-- PB006 - -->
</pattern>
<message>Controllare il tempo dei verbi utilizzati nella
frase.</message>
That removes the error with GR_10_001[4]but there is then still
another false error with rule GR_10_001[1]
Paolo Bianchini wrote:
> The question is: is it better to have false positives or to miss some errors?
Personally, I prefer few false positive to missing some real errors.
Of course, ideally you want to reduce false positives without missing errors
but if there is a choice, I'd say that false positive are more annoying that
missing errors.
The number of false positives should be much smaller than the
number of real errors on a typical text. Of course on a perfect text,
you can only have false positive :-) In Italian, I see more false
positives than real errors at the moment. Furthermore, false positives
in Italian also often highlight large portions of sentences which may
hides other real errors. I prefer why only 1 or few words are
highlighted.
If I check a typical article in a newspaper, I would ideally expect none or
very few false positive with LanguageTool.
Regards
-- Dominique
------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122712
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel
------------------------------------------------------------------------------
Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
much more. Get web development skills now with LearnDevNow -
350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122812
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel