Re: new syntax available

Dominique Pellé Fri, 09 Oct 2015 21:18:03 -0700

Daniel Naber wrote:

> On 2015-10-09 07:32, Dominique Pellé wrote:
>
>> I suppose that I care more than most because I only use LT to check
>> text files where the situation is frequent.
>
> I think normalizing the text makes sense if:
> 1) single line breaks get removed from plain text files (but not double
> spaces)
> 2) this normalization doesn't happen in LT core, but in the command-line
> client
>
> My understanding is that's not enough for your use case as you use
> spaces for indentation? For me, this sounds like a general input format
> issue, just like people want to use LT to check LaTeX. We cannot support
> that in the core, but if we find a way to do it outside that would be
> okay for me. We just need to avoid becoming a parser for every format
> out there.
>
> We already have the concept of annotated text[1], I think this could be
> used to check plain text files. "\n" is then markup just like "<h1>" is
> markup in XML. So we don't need normalization in that sense, but we need
> to parse the input.
>
> [1]
> https://languagetool.org/development/api/org/languagetool/markup/AnnotatedText.html

I'm not sure I understand how it would work for users.
Would users have to give an option? Command line, or check box
for the GUI? That seems unfortunate, since it worked well before
without specifying an option, which users may not be aware of.

I wonder how many users copy paste text in the web interface
of LT. Those users will also have degraded experience.

I seem to be the only one really bothered with the regression.
I don't mean to be too negative about it. I like the new <regexp>
feature, but I don't like the regression because text format is
ubiquitous and many text files use multiple double spaces as
well as line breaks in sentences.

I could instead use \s+ in regexp for fr, eo, br that I maintain.
But it's not nice if only those 3 languages work.
And yes, it would clutter regexps, but I'd still find it acceptable.

Mike Unwalla wrote:

> I understand why you want to preprocess text. Sometimes, I have a similar
> problem. Sometimes, I want to ignore multiple spaces, line breaks, and tab
> characters.
>
> However, automatically ignoring such text could cause problems. For example,
> not all double spaces are errors. For the Netherlands, "there should be a
> double space between the postcode and the post town"
> (http://www.royalmail.com/personal/help-and-support/Addressing-your-items-Western-Europe).

That's true.  It's a rare case, but it's good to be able to detect
such errors.

Ironically, the example given in your link does not respect
the rule it preaches for the Dutch address since I see only one space
between the postal codes in the post town in "2312 BK LEIDEN".
The address in Luxembourg is also misspelled (Longway -> Longwy)
but that's off-topic.

Your link gives me the idea of writing semantic rules to check
address formating in various countries. Examples of rules for
checking addresses in France:
- house number should be before street name
- postal code should be before city name
- postal code should be 5 digits without space (29200 is ok, 29 200 is wrong)
- etc.

Good example:
    23 Rue de l’église
    29200 BREST
    FRANCE

Bad example (postal code after city name):
   23 Rue de l’église
   BREST 29200
   FRANCE

The <regexp> feature will be great for such rules.
Something like this may work (no tested)

<regexp>\b(Rue|Avenue|Av\.|Place|Pl\.|Boulvevard|Boul\.)\s.*\n\s+\d{5}\s+\p{Lu}.*\n\s+FRANCE\b</regexp>

> I did not mean that you should not preprocess text. I meant that you should
> not mess with the meaning of a regexp.
>
> Possibly, we can solve the conflict by having 2 types of <regexp>:
> <regexp type="exact-meaning">
> <regexp type=" smart">

That would be ideal in my opinion.
Use of "exact-meaning" would be very rare.
Maybe a better name: <regexp collapse_spaces="no">

Regards
Dominique

------------------------------------------------------------------------------
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: new syntax available

Reply via email to