Daniel Naber <list2...@danielnaber.de> wrote:

> On 16.11.2012, 15:24:16 Marcin Miłkowski wrote:
>
>> Hm, just a stupid question, but did you remember that loading rules is
>> also quite expensive? We have to parse XML and compile regexes, and read
>> from disk.
>
> Loading rules takes about 130ms once the system is warmed up. That's not
> fast, but still not much compared to the 2 seconds checking takes in total.
>
> Here's what I do: I start HTTPSServer locally, then I start
> HTTPSServerTesting with only 1 thread and the language set to German (to
> make things less complex). The result of profiling is attached, it's a
> screenshot of visualvm.
>
> Regards
>  Daniel

Hi Daniel

One thing that I just noticed, is that German has rather
verbose POS tags using substrings like "SUB" (for Substantiv),
"AKK" (for Akkusativ), etc.  This may be harming performances.
Having single letters for those things not only makes matching
POS tag faster, but perhaps more importantly allows writing
more efficient regexpes.

For example what is currently written as..

   postag="(PA2|SUB|ART|ADJ):.+"

Could be written as:

   postag="[PSAJ]:.+"

Assuming that:
P == PA2
S == SUB
A == ART
J == ADJ

As you can see, the regexp alternative |  (which is expensive)
can be replaced with the cheaper [...] and the capturing parenthesis
were also removed  (capturing may prevent optimisations in some
regexp engines).

Well, of course it would be a lot of work to change the POS
tags and it may harm readability.   But I found that single
letters work well for the languages that I maintain.

Something easier that you could change are things like this:

postag="VER:(1|2|3):.*"  ---->  postag="VER:[123]:.*".

I also see regexps that look wrong. This for example:

<token regexp="yes">heilige(n)</token>

The capturing group here is useless.  I suspect that
what was meant was to have the n optional as in:

<token regexp="yes">heiligen?</token>

The same error happens in several other places, for example:

<token regexp="yes">(k)eine|diese|(&meindein;)e|nur|bloß</token>

... should probably be:

<token regexp="yes">k?eine|diese|(&meindein;)e|nur|bloß</token>

Regards
-- Dominique

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to