Daniel Naber <daniel.na...@languagetool.org> wrote: > Hi, > > there's now a first and limited implementation of the <regexp> syntax in > master. Instead of > > <pattern><token>foo</token></pattern> > > you can now use > > <regexp>foo</regex> > > But be aware that this is a real regular expression that ignores tokens, > so it matches anything with the substring 'foo'. Also, the regular > expression is case-insensitive by default. You can have a look at the > German grammar.xml for many examples. > > To make use of these, you can adapt and run RuleSimplifier in the dev > package. It tries to convert simple rules automatically, but it's just a > hack, the new rules need to be tested and adapted manually. It also only > touches rules without '<marker>' elements. There's no <marker> for > regexp, it's always the complete match that will be underlined. You > obviously cannot use the regex to access the part-of-speech tags of the > match. But replacements are also limited, e.g. changing case currently > doesn't work. By using \1 you can access the first matching group, i.e. > the first parenthesis group of the regexp etc. > > Please let me know how this works for you. > > Regards > Daniel
Hi Daniel First of all, thanks for implementing it. But I have questions or remarks :-) To me, the idea of <regexp> is useful when we can replace many pattern rules into a single one, which helps to reduce the number of rules and improve maintenability of the rules. I see this example in German: <rulegroup id="GIRLS_DAY" name="Eigenname: 'Girl's (Girls’) Day'"> <short>&eigenname;</short> <rule> <regexp>(girl|boy)['’`´‘]s day</regexp> <message>Meinten Sie den Aktionstag <suggestion>\1s’ Day</suggestion>?</message> <example correction="Girls’ Day">Der <marker>Girl's Day</marker> findet einmal im Jahr statt.</example> <example correction="Boys’ Day">Der <marker>Boy`s Day</marker> ist ein Aktionstag.</example> </rule> <rule> <regexp>(girls|boys)[´`‘] day</regexp> <message>Meinten Sie den Aktionstag <suggestion>\1’ Day</suggestion>?</message> <short>&eigenname;</short> <example>Der <marker>Boys’ Day</marker> findet einmal im Jahr statt.</example> <example correction="Boys’ Day">Der <marker>Boys` Day</marker> ist ein Aktionstag.</example> </rule> <rule> <regexp>(girls|boys) day</regexp> <message>Meinten Sie den Aktionstag <suggestion>\1’ Day</suggestion>?</message> <example correction="Girls’ Day">Der <marker>Girls Day</marker> findet einmal im Jahr statt.</example> <example correction="Boys’ Day">Der <marker>Boys Day</marker> ist ein Aktionstag.</example> </rule> <rule> <regexp>(girls|boys)['’`´‘]day</regexp> <message>Meinten Sie den Aktionstag <suggestion>\1’ Day</suggestion>?</message> <example correction="Girls’ Day">Der <marker>Girls'Day</marker> findet einmal im Jahr statt.</example> <example correction="Boys’ Day">Der <marker>Boys`Day</marker> ist ein Aktionstag.</example> </rule> </rulegroup> The idea of <regexp> is that it should be now possible to have a single rule instead of many rules, using something more or less like this: <rule> <regexp>(girl|boy)s?[´`‘]?s? day</regexp> ... </rule> I have not used <regexp> and I have questions before I used it. 1) How do I highlight only a subset of the match? Trying the above rule, I see this: Line 1, column 8, Rule ID: GIRLS_DAY[1] Message: Meinten Sie den Aktionstag 'girls’ Day'? Suggestion: girls’ Day It's a girl's day. ^^^^^^^^^^ But what if I wanted to highlight only the word girl? Maybe highlighting the full pattern is OK in the above example, but where I'd like to use <regexp>, I do not want to highlight the full pattern but only part of it, possibly a single word. For example in those French expressions... a nouveau -> à nouveau a plein temps -> à plein temps a rude épreuve -> à rude épreuve a vol d'oiseau -> à vol d'oiseau ... (etc, more cases in reality...) I'm thinking of creating such a rule: <regexp>a (nouveau|plein temps|rude épreuve|vol d['’´`‘]oiseau)</regexp> ... but how to say to hightlight/underline the word "a" only? I don't see the equivalent of <marker>...</marker>. How about something like this? <regexp marker="1">(a) (nouveau|plein temps|rude épreuve|vol d['’´`‘]oiseau)</regexp> ... where the marker="1" attribute indicates to underline the captured group #1 in the regxp, i.e. the word "a" in above example? What to underline could even possibly be a portion of word. 2) Is there always an implicit word boundary at the beginning or end of <regexp>? In the German grammar, I see this for example: <regexp case_sensitive="yes">Elisabeth (Selber|Selberth)\b</regexp> Since I see \b at the end, it suggests that there is no implicit \b. But I don't see it at the beginning, so I wonder whether \b is needed or not. Having an attribute to <regexp> to disable implicit \b could be useful sometimes. Enabling it by default would be best. 3) I see that the German grammar now uses <regexp> in many rules, even for very simple patterns like: <rule id="ZU_LETZT" name="Zusammen-/Getrenntschreibung: zu letzt (zuletzt)"> <regexp>zu letzt</regexp>¬ I wonder whether there is a performance impact. Here, the older way of using <token> still seemed acceptable to me, and possibly faster (no regexp). Keep in mind that regexp matching of long phraes can be slow for some regexps. This depends on the regexp engine. DFA regexp engine should be O(n) where n is the length of the line I think, whereas NFA engines can be much, much slower. But DFA engine typically are slower to compile regexp than NFA, use more memory and have more limitation (often, no back references). See https://swtch.com/~rsc/regexp/regexp1.html Java uses a NFA regexp engine I think. So this means that regexp matching of long phrases could run into the risk of being very slow in some cases where there is lots of backtracking. Vim was using a DFA regexp engine until recently, Vim-7.4 introduced a hybrid engine that could do NFA for most regexp, and it made a massive speed improvement when doing pattern matching on long lines. The speed improvement was visible to me when doing syntax highlighting in French grammar.xml in Vim where there are long lines. So I'm not sure about performances of the new <regexp> feature. At least, regexp have to be written carefully to avoid long backtracking issue. This was less of a problem with tokenization, as we only matched words which are short. If <regexp> turns out to be comparable in speed than <token> then it's OK and rules with <regexp> are slightly shorter even for simple patterns as the ZU_LETZT example. Regards Dominique ------------------------------------------------------------------------------ Full-scale, agent-less Infrastructure Monitoring from a single dashboard Integrate with 40+ ManageEngine ITSM Solutions for complete visibility Physical-Virtual-Cloud Infrastructure monitoring from one console Real user monitoring with APM Insights and performance trend reports Learn More http://pubads.g.doubleclick.net/gampad/clk?id=247754911&iu=/4140 _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel