Daniel Naber <daniel.na...@languagetool.org> wrote:

> Hi,
>
> there's now a first and limited implementation of the <regexp> syntax in
> master. Instead of
>
> <pattern><token>foo</token></pattern>
>
> you can now use
>
> <regexp>foo</regex>
>
> But be aware that this is a real regular expression that ignores tokens,
> so it matches anything with the substring 'foo'. Also, the regular
> expression is case-insensitive by default. You can have a look at the
> German grammar.xml for many examples.
>
> To make use of these, you can adapt and run RuleSimplifier in the dev
> package. It tries to convert simple rules automatically, but it's just a
> hack, the new rules need to be tested and adapted manually. It also only
> touches rules without '<marker>' elements. There's no <marker> for
> regexp, it's always the complete match that will be underlined. You
> obviously cannot use the regex to access the part-of-speech tags of the
> match. But replacements are also limited, e.g. changing case currently
> doesn't work. By using \1 you can access the first matching group, i.e.
> the first parenthesis group of the regexp etc.
>
> Please let me know how this works for you.
>
> Regards
>   Daniel


Hi Daniel

First of all, thanks for implementing it.  But I have questions or remarks :-)

To me, the idea of <regexp> is useful when we can replace many pattern
rules into a single one, which helps to reduce the number of rules and
improve maintenability of the rules. I see this example in German:

        <rulegroup id="GIRLS_DAY" name="Eigenname: 'Girl's (Girls’) Day'">
            <short>&eigenname;</short>
            <rule>
                <regexp>(girl|boy)['’`´‘]s day</regexp>
                <message>Meinten Sie den Aktionstag <suggestion>\1s’
Day</suggestion>?</message>
                <example correction="Girls’ Day">Der <marker>Girl's
Day</marker> findet einmal im Jahr statt.</example>
                <example correction="Boys’ Day">Der <marker>Boy`s
Day</marker> ist ein Aktionstag.</example>
            </rule>
            <rule>
                <regexp>(girls|boys)[´`‘] day</regexp>
                <message>Meinten Sie den Aktionstag <suggestion>\1’
Day</suggestion>?</message>
                <short>&eigenname;</short>
                <example>Der <marker>Boys’ Day</marker> findet einmal
im Jahr statt.</example>
                <example correction="Boys’ Day">Der <marker>Boys`
Day</marker> ist ein Aktionstag.</example>
            </rule>
            <rule>
                <regexp>(girls|boys) day</regexp>
                <message>Meinten Sie den Aktionstag <suggestion>\1’
Day</suggestion>?</message>
                <example correction="Girls’ Day">Der <marker>Girls
Day</marker> findet einmal im Jahr statt.</example>
                <example correction="Boys’ Day">Der <marker>Boys
Day</marker> ist ein Aktionstag.</example>
            </rule>
            <rule>
                <regexp>(girls|boys)['’`´‘]day</regexp>
                <message>Meinten Sie den Aktionstag <suggestion>\1’
Day</suggestion>?</message>
                <example correction="Girls’ Day">Der
<marker>Girls'Day</marker> findet einmal im Jahr statt.</example>
                <example correction="Boys’ Day">Der
<marker>Boys`Day</marker> ist ein Aktionstag.</example>
            </rule>
        </rulegroup>

The idea of <regexp> is that it should be now possible to have a
single rule instead
of many rules, using something more or less like this:

<rule>
  <regexp>(girl|boy)s?[´`‘]?s? day</regexp>
  ...
</rule>


I have not used <regexp> and I have questions before I used it.

1) How do I highlight only a subset of the match?   Trying the above
rule, I see this:

  Line 1, column 8, Rule ID: GIRLS_DAY[1]
  Message: Meinten Sie den Aktionstag 'girls’ Day'?
  Suggestion: girls’ Day
  It's a girl's day.
         ^^^^^^^^^^

But what if I wanted to highlight only the word girl? Maybe
highlighting the full pattern
is OK in the above example, but where I'd like to use <regexp>, I do
not want to highlight
the full pattern but only part of it, possibly a single word.  For
example in those
French expressions...

        a nouveau -> à nouveau
        a plein temps -> à plein temps
        a rude épreuve -> à rude épreuve
        a vol d'oiseau -> à vol d'oiseau
        ... (etc, more cases in reality...)

I'm thinking of creating such a rule:

  <regexp>a (nouveau|plein temps|rude épreuve|vol d['’´`‘]oiseau)</regexp>

... but how to say to hightlight/underline the word "a" only?  I don't
see the equivalent of <marker>...</marker>.

How about something like this?

  <regexp marker="1">(a) (nouveau|plein temps|rude épreuve|vol
d['’´`‘]oiseau)</regexp>

... where the marker="1" attribute indicates to underline the captured
group #1 in the regxp, i.e. the word "a" in above example?  What to underline
could even possibly be a portion of word.


2) Is there always an implicit word boundary at the beginning or end
of <regexp>?

In the German grammar, I see this for example:
     <regexp case_sensitive="yes">Elisabeth (Selber|Selberth)\b</regexp>

Since I see \b at the end, it suggests that there is no implicit \b.
But I don't see it at the beginning, so I wonder whether \b is needed or not.
Having an attribute to <regexp> to disable implicit \b could
be useful sometimes. Enabling it by default would be best.

3) I see that the German grammar now uses <regexp> in many rules, even for
very simple patterns like:

  <rule id="ZU_LETZT" name="Zusammen-/Getrenntschreibung: zu letzt (zuletzt)">
    <regexp>zu letzt</regexp>¬

I wonder whether there is a performance impact.  Here, the older way
of using <token> still seemed acceptable to me, and possibly faster
(no regexp). Keep in mind that regexp matching of long phraes can
be slow for some regexps.  This depends on the regexp
engine.  DFA regexp engine should be O(n) where n is the length of
the line I think, whereas NFA engines can be much, much slower.
But DFA engine typically are slower to compile regexp than NFA,
use more memory and have more limitation (often, no back references).

See https://swtch.com/~rsc/regexp/regexp1.html

Java uses a NFA regexp engine I think.  So this means that regexp matching
of long phrases could run into the risk of being very slow in some cases
where there is lots of backtracking.

Vim was using a DFA regexp engine until recently, Vim-7.4 introduced
a hybrid engine that could do NFA for most regexp, and it made a massive
speed improvement when doing pattern matching on long lines.
The speed improvement was visible to me when doing syntax
highlighting in French grammar.xml in Vim where there are long lines.

So I'm not sure about performances of the new <regexp> feature.
At least, regexp have to be written carefully to avoid long backtracking
issue. This was less of a problem with tokenization, as we only matched
words which are short.

If <regexp> turns out to be comparable in speed than <token> then it's OK
and rules with <regexp> are slightly shorter even for simple patterns as
the ZU_LETZT example.

Regards
Dominique

------------------------------------------------------------------------------
Full-scale, agent-less Infrastructure Monitoring from a single dashboard
Integrate with 40+ ManageEngine ITSM Solutions for complete visibility
Physical-Virtual-Cloud Infrastructure monitoring from one console
Real user monitoring with APM Insights and performance trend reports 
Learn More http://pubads.g.doubleclick.net/gampad/clk?id=247754911&iu=/4140
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to