Daniel Naber <list2...@danielnaber.de> wrote: > On 16.11.2012, 15:24:16 Marcin Miłkowski wrote: > >> Hm, just a stupid question, but did you remember that loading rules is >> also quite expensive? We have to parse XML and compile regexes, and read >> from disk. > > Loading rules takes about 130ms once the system is warmed up. That's not > fast, but still not much compared to the 2 seconds checking takes in total. > > Here's what I do: I start HTTPSServer locally, then I start > HTTPSServerTesting with only 1 thread and the language set to German (to > make things less complex). The result of profiling is attached, it's a > screenshot of visualvm. > > Regards > Daniel
Hi Daniel One thing that I just noticed, is that German has rather verbose POS tags using substrings like "SUB" (for Substantiv), "AKK" (for Akkusativ), etc. This may be harming performances. Having single letters for those things not only makes matching POS tag faster, but perhaps more importantly allows writing more efficient regexpes. For example what is currently written as.. postag="(PA2|SUB|ART|ADJ):.+" Could be written as: postag="[PSAJ]:.+" Assuming that: P == PA2 S == SUB A == ART J == ADJ As you can see, the regexp alternative | (which is expensive) can be replaced with the cheaper [...] and the capturing parenthesis were also removed (capturing may prevent optimisations in some regexp engines). Well, of course it would be a lot of work to change the POS tags and it may harm readability. But I found that single letters work well for the languages that I maintain. Something easier that you could change are things like this: postag="VER:(1|2|3):.*" ----> postag="VER:[123]:.*". I also see regexps that look wrong. This for example: <token regexp="yes">heilige(n)</token> The capturing group here is useless. I suspect that what was meant was to have the n optional as in: <token regexp="yes">heiligen?</token> The same error happens in several other places, for example: <token regexp="yes">(k)eine|diese|(&meindein;)e|nur|bloß</token> ... should probably be: <token regexp="yes">k?eine|diese|(&meindein;)e|nur|bloß</token> Regards -- Dominique ------------------------------------------------------------------------------ Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel