Daniel Naber wrote: > On 2015-10-09 07:32, Dominique Pellé wrote: > >> I suppose that I care more than most because I only use LT to check >> text files where the situation is frequent. > > I think normalizing the text makes sense if: > 1) single line breaks get removed from plain text files (but not double > spaces) > 2) this normalization doesn't happen in LT core, but in the command-line > client > > My understanding is that's not enough for your use case as you use > spaces for indentation? For me, this sounds like a general input format > issue, just like people want to use LT to check LaTeX. We cannot support > that in the core, but if we find a way to do it outside that would be > okay for me. We just need to avoid becoming a parser for every format > out there. > > We already have the concept of annotated text[1], I think this could be > used to check plain text files. "\n" is then markup just like "<h1>" is > markup in XML. So we don't need normalization in that sense, but we need > to parse the input. > > [1] > https://languagetool.org/development/api/org/languagetool/markup/AnnotatedText.html
I'm not sure I understand how it would work for users. Would users have to give an option? Command line, or check box for the GUI? That seems unfortunate, since it worked well before without specifying an option, which users may not be aware of. I wonder how many users copy paste text in the web interface of LT. Those users will also have degraded experience. I seem to be the only one really bothered with the regression. I don't mean to be too negative about it. I like the new <regexp> feature, but I don't like the regression because text format is ubiquitous and many text files use multiple double spaces as well as line breaks in sentences. I could instead use \s+ in regexp for fr, eo, br that I maintain. But it's not nice if only those 3 languages work. And yes, it would clutter regexps, but I'd still find it acceptable. Mike Unwalla wrote: > I understand why you want to preprocess text. Sometimes, I have a similar > problem. Sometimes, I want to ignore multiple spaces, line breaks, and tab > characters. > > However, automatically ignoring such text could cause problems. For example, > not all double spaces are errors. For the Netherlands, "there should be a > double space between the postcode and the post town" > (http://www.royalmail.com/personal/help-and-support/Addressing-your-items-Western-Europe). That's true. It's a rare case, but it's good to be able to detect such errors. Ironically, the example given in your link does not respect the rule it preaches for the Dutch address since I see only one space between the postal codes in the post town in "2312 BK LEIDEN". The address in Luxembourg is also misspelled (Longway -> Longwy) but that's off-topic. Your link gives me the idea of writing semantic rules to check address formating in various countries. Examples of rules for checking addresses in France: - house number should be before street name - postal code should be before city name - postal code should be 5 digits without space (29200 is ok, 29 200 is wrong) - etc. Good example: 23 Rue de l’église 29200 BREST FRANCE Bad example (postal code after city name): 23 Rue de l’église BREST 29200 FRANCE The <regexp> feature will be great for such rules. Something like this may work (no tested) <regexp>\b(Rue|Avenue|Av\.|Place|Pl\.|Boulvevard|Boul\.)\s.*\n\s+\d{5}\s+\p{Lu}.*\n\s+FRANCE\b</regexp> > I did not mean that you should not preprocess text. I meant that you should > not mess with the meaning of a regexp. > > Possibly, we can solve the conflict by having 2 types of <regexp>: > <regexp type="exact-meaning"> > <regexp type=" smart"> That would be ideal in my opinion. Use of "exact-meaning" would be very rare. Maybe a better name: <regexp collapse_spaces="no"> Regards Dominique ------------------------------------------------------------------------------ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel