Re: Behavior of non-breaking space U+00A0 in LanguageTool

Andriy Rysin Sun, 11 Oct 2015 11:56:18 -0700

I woud agree with that, I had to add 00A0 in a lot of places including
sentence tokenizer, word tokenizer and some rules for Ukrainian. But
from text analysis it's pretty much the same as normal space so it
would make sense to handle this at common level (early in the
process).


Thanks
Andriy


2015-10-11 13:18 GMT-04:00 Dominique Pellé <dominique.pe...@gmail.com>:
> Hi
>
> Consider this very simple rule in the English grammar.xml:
>
> <rule id="EGG_YOKE" name="egg yoke (egg yolk)">
>   <pattern>
>     <token>egg</token>
>     <token>yoke</token>
>   </pattern>
>
> The rule works fine of the 2 words are separated with
> at least spaces, tabs or newlines. However, it does not
> work when the 2 words are separated with a non-breaking
> space (U+000A0).  I wonder why.
>
> With a normal space (U+0020):
>
> $ echo "Egg yoke." | java languagetool-commandline.jar -l en -v -
> Expected text language: English (no spell checking active, specify a
> language variant like 'en-GB' if available)
> Working on STDIN...
> 1283 rules activated for language English
> 1283 rules activated for language English
> <S> Egg[egg/NN:UN,egg/VB,egg/VBP,B-NP-singular]
> yoke[yoke/NN,yoke/VB,yoke/VBP,E-NP-singular].[./.,</S>,O]<P/>
> Disambiguator log:
>
> 1.) Line 1, column 1, Rule ID: EGG_YOKE[1]
> Message: Did you mean 'egg yolk'?
> Suggestion: Egg yolk
> Egg yoke.
> ^^^^^^^^
> Time: 2405ms for 0 sentences (0.0 sentences/sec)
>
>
> With a non breaking space:
>
> $ echo "Egg yoke." | java -jar languagetool-commandline.jar -l en -v -
> Expected text language: English (no spell checking active, specify a
> language variant like 'en-GB' if available)
> Working on STDIN...
> 1283 rules activated for language English
> 1283 rules activated for language English
> <S> Egg[egg/NN:UN,egg/VB,egg/VBP,B-NP-singular] [
> /null]yoke[yoke/NN,yoke/VB,yoke/VBP].[./.,</S>]<P/>
> Disambiguator log:
>
> Time: 2347ms for 0 sentences (0.0 sentences/sec)
>
> With non-breaking space, the debug output shows
> a token [ /null] i.e. the non-breaking space is
> interpreted as word, and the rule does not match
> as a result.
>
> I think that spaces or non-breaking spaces should behave
> the same for LanguageTool. The only purpose of non-breaking
> space is for presentation, so that a line is not broken at
> a non-breaking space, but everything else should not be
> affected by non breaking space (same word count, same
> word tokenization, …)
>
> Regards
> Dominique
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>

------------------------------------------------------------------------------
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: Behavior of non-breaking space U+00A0 in LanguageTool

Reply via email to