I woud agree with that, I had to add 00A0 in a lot of places including sentence tokenizer, word tokenizer and some rules for Ukrainian. But from text analysis it's pretty much the same as normal space so it would make sense to handle this at common level (early in the process).
Thanks Andriy 2015-10-11 13:18 GMT-04:00 Dominique Pellé <dominique.pe...@gmail.com>: > Hi > > Consider this very simple rule in the English grammar.xml: > > <rule id="EGG_YOKE" name="egg yoke (egg yolk)"> > <pattern> > <token>egg</token> > <token>yoke</token> > </pattern> > > The rule works fine of the 2 words are separated with > at least spaces, tabs or newlines. However, it does not > work when the 2 words are separated with a non-breaking > space (U+000A0). I wonder why. > > With a normal space (U+0020): > > $ echo "Egg yoke." | java languagetool-commandline.jar -l en -v - > Expected text language: English (no spell checking active, specify a > language variant like 'en-GB' if available) > Working on STDIN... > 1283 rules activated for language English > 1283 rules activated for language English > <S> Egg[egg/NN:UN,egg/VB,egg/VBP,B-NP-singular] > yoke[yoke/NN,yoke/VB,yoke/VBP,E-NP-singular].[./.,</S>,O]<P/> > Disambiguator log: > > 1.) Line 1, column 1, Rule ID: EGG_YOKE[1] > Message: Did you mean 'egg yolk'? > Suggestion: Egg yolk > Egg yoke. > ^^^^^^^^ > Time: 2405ms for 0 sentences (0.0 sentences/sec) > > > With a non breaking space: > > $ echo "Egg yoke." | java -jar languagetool-commandline.jar -l en -v - > Expected text language: English (no spell checking active, specify a > language variant like 'en-GB' if available) > Working on STDIN... > 1283 rules activated for language English > 1283 rules activated for language English > <S> Egg[egg/NN:UN,egg/VB,egg/VBP,B-NP-singular] [ > /null]yoke[yoke/NN,yoke/VB,yoke/VBP].[./.,</S>]<P/> > Disambiguator log: > > Time: 2347ms for 0 sentences (0.0 sentences/sec) > > With non-breaking space, the debug output shows > a token [ /null] i.e. the non-breaking space is > interpreted as word, and the rule does not match > as a result. > > I think that spaces or non-breaking spaces should behave > the same for LanguageTool. The only purpose of non-breaking > space is for presentation, so that a line is not broken at > a non-breaking space, but everything else should not be > affected by non breaking space (same word count, same > word tokenization, …) > > Regards > Dominique > > ------------------------------------------------------------------------------ > > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel