Re: [PATCH] Ignoring characters

Andriy Rysin Tue, 10 Mar 2015 19:07:04 -0700

I've pushed a fix for MorfologikSpellerRule along with
MorfologikAmericanSpellerRuleTest change to test the problem (BTW
sorry for the full change in the test class - I pressed a wrong button
in Eclipse and it played a bad trick on me).
Now the speller should ignore soft hyphens inside words like other rules do.
There's one interesting point though: if you revert the fix the first
pair of tests in testIgnoredChars() still pass, so morfologil speller
was always ignoring ignored characters in the first word. But the
second pair of tests would fail.


This is because in JLanguageTool.getRawAnalyzedSentence() we only
update tokens starting with i=1. I was not sure what's the right fix
so I left it as it is.

Please let me know if you see any problems,
Thanks,
Andriy

2015-03-08 13:21 GMT-04:00 Andriy Rysin <ary...@gmail.com>:
> I've found one problem with ignored characters: Morfologik speller
> does not skip ignored characters as it gets the sentence with those
> chars left inside. This is demonstrated by the test patch below.
>
> It seems like JLanguageTool.getRawAnalyzedSentence() removes those
> characters, then does the tagging, then puts original tokens back in
> the sentence, and that sentence is fed to speller (as well as other
> rules).
> Speller uses AnalyzedTokenReadings.getToken() which returns the word
> with ignored character. But other rules may work right if they use
> AnalyzedTokenReadings.getAnalyzedToken(0).getToken() (which returns a
> token without those chars). I think we may also want to check
> PatternRule to see which method it uses and if it needs to be ajusted.
>
> I could put a workaround in Ukrainian but it feels like a common
> problem, so if everybody agrees we can fix it in common code. It looks
> like the easiest solution is to make MorfologikSpellerRule use tokens
> without those chars.
>
> Andriy
>
> diff --git 
> a/languagetool-language-modules/uk/src/test/java/org/languagetool/rules/uk/MorfologikUkrainianSpellerRuleTest.java
> b/languagetool-language-modules/uk/src/test/java/org/languagetool/rules/uk/MorfologikUkrainianSpellerRuleTest.java
> index 3118b4e..cd6f011 100644
> --- 
> a/languagetool-language-modules/uk/src/test/java/org/languagetool/rules/uk/MorfologikUkrainianSpellerRuleTest.java
> +++ 
> b/languagetool-language-modules/uk/src/test/java/org/languagetool/rules/uk/MorfologikUkrainianSpellerRuleTest.java
> @@ -45,6 +45,10 @@
>
>      assertEquals(0, rule.match(langTool.getAnalyzedSentence("До нас
> приїде The Beatles!")).length);
>
> +    // soft hyphen
> +    assertEquals(0,
> rule.match(langTool.getAnalyzedSentence("колискової пісні")).length);
> +
> +
>      //incorrect sentences:
>
>      RuleMatch[] matches =
> rule.match(langTool.getAnalyzedSentence("атакуючий"));
>
> 2015-01-21 22:33 GMT-05:00 Andriy Rysin <ary...@gmail.com>:
>> Ok, I've pushed a change to allow per-language set of characters to be
>> ignored in tokens (e.g. Ukrainian adds an accent U+0301 to the soft
>> hypen). Adding a reading with null tag seems to have affected correct
>> position markup so I've adjusted my rules to take that to account.
>>
>> Please try it and let me know how it works for you,
>> Thanks
>> Andriy
>>
>> P.S. One thing I could not figure out (yet) is correct markup for
>> tokens with ignored characters in xml rules, see
>> languagetool-language-modules/uk/src/main/resources/org/languagetool/rules/uk/grammar-spelling.xml:93
>>
>>
>> 2015-01-20 11:55 GMT-05:00 Andriy Rysin <ary...@gmail.com>:
>>> Ok, so I have a token agreement rule which checks if any of the token
>>> readings have the required form. If it found good, if it didn't it'll
>>> show error, but if it finds a reading with null tag it assumes we
>>> don't know enough and will skip the check for this token. It seems for
>>> untagged words we use null tag so this works when reading with null
>>> POSTAG is the only one. If we're saying we can have additional
>>> readings with null which are "information-only" I can probably adjust
>>> the logic I have.
>>>
>>> We could also tag the reading with ignored chars inside the same way
>>> the "cleaned" token is but I am afraid the "dirty" token reading will
>>> affect suggestions etc in the way we don't want.
>>>
>>> Andriy
>>>
>>> 2015-01-20 9:58 GMT-05:00 Daniel Naber <daniel.na...@languagetool.org>:
>>>> On 2015-01-20 14:29, Andriy Rysin wrote:
>>>>
>>>>> So in JLanguageToolTest.testAnalyzedSentence() (line 133) the expected
>>>>> reading for token with soft hyphen excpects tested/null, but I don't
>>>>> really understand this logic.
>>>>
>>>> I think the null is probably not the point, the code in
>>>> JLanguageTool.getRawAnalyzedSentence() seems to re-add the token with
>>>> the soft hyphen again. It probably simply uses null as a POS tag because
>>>> I (or whoever added it) though it shouldn't hurt. So maybe just the
>>>> token needs to be set, not another reading (adding the null reading may
>>>> be just a side effect).
>>>>
>>>> Regards
>>>>   Daniel
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
>>>> GigeNET is offering a free month of service with a new server in Ashburn.
>>>> Choose from 2 high performing configs, both with 100TB of bandwidth.
>>>> Higher redundancy.Lower latency.Increased capacity.Completely compliant.
>>>> http://p.sf.net/sfu/gigenet
>>>> _______________________________________________
>>>> Languagetool-devel mailing list
>>>> Languagetool-devel@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [PATCH] Ignoring characters

Reply via email to