Re: [Languagetool] Performance

Marcin Miłkowski Mon, 19 Nov 2012 01:40:17 -0800

W dniu 2012-11-19 01:12, Dominique Pellé pisze:
> Marcin Miłkowski <list-addr...@wp.pl> wrote:
>
>> W dniu 2012-11-18 22:48, Daniel Naber pisze:
>>> On 18.11.2012, 21:33:32 Marcin Miłkowski wrote:
>>>
>>>> The easiest way to see whether this could speed things up at all is
>>>> change the line 162 in AbstractPatternRule.java to this one:
>>>>
>>>> final int numberOfReadings = 1;
>>>
>>> There's no significant difference. Even when I comment out the whole
>>> testAllReadings() method, the test gets only 20% faster, whereas the
>>> profiling data suggested it should be more like 60%. Maybe the profiling
>>> isn't as precise as I'd like it to be?
>>
>> Frankly, I did profile the pattern rules many times and I've seen not so
>> much room for improvement. Caching previous matches (to answer
>> Dominique) turned out to be more expensive than checking one more time.
>
> OK.  My idea was not about caching though.  Probing a
> cache may indeed be as expensive as pattern matching
> since each pattern matching cheap, but we do it many times.


Yeah, but to reuse the pattern, we would have to build a finite-state 
machine in memory (or on disk) first. This is far from trivial because 
we would have to flatly encode all features (token, pos, lemma) and make 
sure we still have Java Unicode regular expressions. It is, frankly, not 
obvious that this is even logically possible, since Java regular 
expressions are not just regular expressions but have additional 
features, which means that no simple FSA mapping is possible (as such, 
nondeterministic finite machines might help, but because of 
backreferences and similar, we have a fairly complex scenario).

> My idea was about doing it less times, by organizing rules
> (at pre-processing time) in sets for example that share the
> same first token in their pattern. Hopefully, those sets have
> several elements on average for it to work well (I did not
> try to assess how many but I can try). So in each set, we
> check matching of the first token once (no cache probing).
> But I understand that it may be a lot work to change the
> code.  I'll try to see at least how many rules would be in
> each set on average, in each language, to see how much
> potential the idea has.

OK, that's a neat idea.

> In Breton at least, I experience sometimes a combinatorial
> explosion or rules in order to implement what I want.  Of
> course many rules probably slow down.

I don't think that having extra 16 rules changes much as they usually 
don't match anyway. The slowdown cannot be due to this thing.

> Another construct which would help to avoid explosion of
> number of rules is a way to be able to perform several
> substitutions.  Here is an example in Breton:
>
>       <rule id="DAM" name="da + ma = da’m">
>        <pattern>
>          <token>da</token>
>          <token>ma</token>
>        </pattern>
>        <message>Gwelloc’h eo skrivañ <suggestion>\1’m</suggestion>.</message>
>        <example type="incorrect">Lavaret em eus <marker>da ma</marker>
> zad.</example>
>        <example type="correct">Lavaret em eus da’m zad.</example>
>      </rule>
>
>      <rule id="DAZ" name="da + da = da’z">
>        <pattern>
>          <token>da</token>
>          <token>da</token>
>        </pattern>
>        <message>Gwelloc’h eo skrivañ <suggestion>\1’z</suggestion>.</message>
>        <example type="incorrect">Lavaret em eus <marker>da da</marker>
> dad.</example>
>        <example type="correct">Lavaret em eus da’z tad.</example>
>      </rule>
>
> Those 2 rules are almost the same. I wish I could write them in
> one single rule with the pattern....

Right. You think of conditional search replace (if ma, then ’m; if da, 
then ’z). If it were ma -> ’m and da -> ’d, then you could simply 
replace to ’$1, where you'd match ([dm]) in the regexp. Now, since you 
have ’z as the second replacement, you can try another trick. Simply 
make two <match> elements: first for "ma", second for "da", and make 
sure they are exclusive. One will produce an empty string, and another 
the string you want. I did not test it, but the idea is simple enough. 
The only caveat is that I don't remember what <match> does by default if 
it produces an empty string via substitution. For some time, it did 
produce the original string in parentheses, but we can change it easily 
if it still does (I remember I changed some of this because of 
spell-checking).

Regards,
Marcin

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [Languagetool] Performance

Reply via email to