Re: [Languagetool] Performance

Dominique Pellé Sun, 18 Nov 2012 17:02:35 -0800

Dominique Pellé wrote:

> Marcin Miłkowski <list-addr...@wp.pl> wrote:
>
>> W dniu 2012-11-18 22:48, Daniel Naber pisze:
>>> On 18.11.2012, 21:33:32 Marcin Miłkowski wrote:
>>>
>>>> The easiest way to see whether this could speed things up at all is
>>>> change the line 162 in AbstractPatternRule.java to this one:
>>>>
>>>> final int numberOfReadings = 1;
>>>
>>> There's no significant difference. Even when I comment out the whole
>>> testAllReadings() method, the test gets only 20% faster, whereas the
>>> profiling data suggested it should be more like 60%. Maybe the profiling
>>> isn't as precise as I'd like it to be?
>>
>> Frankly, I did profile the pattern rules many times and I've seen not so
>> much room for improvement. Caching previous matches (to answer
>> Dominique) turned out to be more expensive than checking one more time.
>
> OK.  My idea was not about caching though.  Probing a
> cache may indeed be as expensive as pattern matching
> since each pattern matching cheap, but we do it many times.
>
> My idea was about doing it less times, by organizing rules
> (at pre-processing time) in sets for example that share the
> same first token in their pattern. Hopefully, those sets have
> several elements on average for it to work well (I did not
> try to assess how many but I can try). So in each set, we
> check matching of the first token once (no cache probing).
> But I understand that it may be a lot work to change the
> code.  I'll try to see at least how many rules would be in
> each set on average, in each language, to see how much
> potential the idea has.



I've just written a crude Perl script to access how many
rules share the same first token on average.  It's not as
much as I hoped so the idea of optimizing this is not worth
if it adds complexity.

Here are the numbers for a few languages (those for which
development is most active):


=== Checking br/grammar.xml ===
Number of patterns ......................... 401
Distinct number of first tokens ............ 232
Average #patterns sharing same 1st token ... 1.728
=== Checking ca/grammar.xml ===
Number of patterns ......................... 845
Distinct number of first tokens ............ 531
Average #patterns sharing same 1st token ... 1.591
=== Checking de/grammar.xml ===
Number of patterns ......................... 1339
Distinct number of first tokens ............ 871
Average #patterns sharing same 1st token ... 1.537
=== Checking en/grammar.xml ===
Number of patterns ......................... 769
Distinct number of first tokens ............ 501
Average #patterns sharing same 1st token ... 1.535
=== Checking eo/grammar.xml ===
Number of patterns ......................... 228
Distinct number of first tokens ............ 194
Average #patterns sharing same 1st token ... 1.175
=== Checking pl/grammar.xml ===
Number of patterns ......................... 783
Distinct number of first tokens ............ 530
Average #patterns sharing same 1st token ... 1.477
=== Checking pt/grammar.xml ===
Number of patterns ......................... 137
Distinct number of first tokens ............ 104
Average #patterns sharing same 1st token ... 1.317
=== Checking ru/grammar.xml ===
Number of patterns ......................... 121
Distinct number of first tokens ............ 90
Average #patterns sharing same 1st token ... 1.344
=== Checking zh/grammar.xml ===
Number of patterns ......................... 397
Distinct number of first tokens ............ 333
Average #patterns sharing same 1st token ... 1.192


This means that for German for example, since on average
1.537 rules have the same first token, even if we optimised
so that checking common first token is done only once, we
would speed up by a factor x1.537 which is not so much if
it requires major changes.  The language that would benefit
the most is Breton which would be sped up by x1.728.

The script is crude (it does not take into account
comments in grammar.xml, it does not find that
<token postag="N.*" postag_regexp="yes"/> is the
same as <token postag_regexp="yes"postag="N.*">, etc.)
but anyway, the numbers should be close the the reality.

Here is the script to find above numbers (for reference):

$ cat first_tok.pl
#!/usr/bin/perl -w

my $first_token = 0;
my $first_token_count = 0;
my %count_first_token;

while (<>) {
  if (/<pattern(.*)>/) {
    $first_token = 1;
    $pattern_opt = $1;
  } else {
    if ($first_token) {
      if (m{<token[^>]*>.*</token>} || m{<token[^>]+/>}) {
        ++$count_first_token{"$pattern_opt:$&"};
        ++$first_token_count;
      }
    }
    $first_token = 0;
  }
}
my $distinct_first_token_count = scalar(keys %count_first_token);
printf "Number of patterns ......................... %d\n",
$first_token_count;
printf "Distinct number of first tokens ............ %d\n",
$distinct_first_token_count;
printf "Average #patterns sharing same 1st token ... %.3f\n",
       $first_token_count/$distinct_first_token_count;

And you can run it as follows:

$ ./first_tok.pl de/grammar.xml
Number of patterns ......................... 1339
Distinct number of first tokens ............ 871
Average #patterns sharing same 1st token ... 1.537

Regards
-- Dominique

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov

_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [Languagetool] Performance

Reply via email to