W dniu 2012-11-19 01:12, Dominique Pellé pisze: > Marcin Miłkowski <list-addr...@wp.pl> wrote: > >> W dniu 2012-11-18 22:48, Daniel Naber pisze: >>> On 18.11.2012, 21:33:32 Marcin Miłkowski wrote: >>> >>>> The easiest way to see whether this could speed things up at all is >>>> change the line 162 in AbstractPatternRule.java to this one: >>>> >>>> final int numberOfReadings = 1; >>> >>> There's no significant difference. Even when I comment out the whole >>> testAllReadings() method, the test gets only 20% faster, whereas the >>> profiling data suggested it should be more like 60%. Maybe the profiling >>> isn't as precise as I'd like it to be? >> >> Frankly, I did profile the pattern rules many times and I've seen not so >> much room for improvement. Caching previous matches (to answer >> Dominique) turned out to be more expensive than checking one more time. > > OK. My idea was not about caching though. Probing a > cache may indeed be as expensive as pattern matching > since each pattern matching cheap, but we do it many times.
Yeah, but to reuse the pattern, we would have to build a finite-state machine in memory (or on disk) first. This is far from trivial because we would have to flatly encode all features (token, pos, lemma) and make sure we still have Java Unicode regular expressions. It is, frankly, not obvious that this is even logically possible, since Java regular expressions are not just regular expressions but have additional features, which means that no simple FSA mapping is possible (as such, nondeterministic finite machines might help, but because of backreferences and similar, we have a fairly complex scenario). > My idea was about doing it less times, by organizing rules > (at pre-processing time) in sets for example that share the > same first token in their pattern. Hopefully, those sets have > several elements on average for it to work well (I did not > try to assess how many but I can try). So in each set, we > check matching of the first token once (no cache probing). > But I understand that it may be a lot work to change the > code. I'll try to see at least how many rules would be in > each set on average, in each language, to see how much > potential the idea has. OK, that's a neat idea. > In Breton at least, I experience sometimes a combinatorial > explosion or rules in order to implement what I want. Of > course many rules probably slow down. I don't think that having extra 16 rules changes much as they usually don't match anyway. The slowdown cannot be due to this thing. > Another construct which would help to avoid explosion of > number of rules is a way to be able to perform several > substitutions. Here is an example in Breton: > > <rule id="DAM" name="da + ma = da’m"> > <pattern> > <token>da</token> > <token>ma</token> > </pattern> > <message>Gwelloc’h eo skrivañ <suggestion>\1’m</suggestion>.</message> > <example type="incorrect">Lavaret em eus <marker>da ma</marker> > zad.</example> > <example type="correct">Lavaret em eus da’m zad.</example> > </rule> > > <rule id="DAZ" name="da + da = da’z"> > <pattern> > <token>da</token> > <token>da</token> > </pattern> > <message>Gwelloc’h eo skrivañ <suggestion>\1’z</suggestion>.</message> > <example type="incorrect">Lavaret em eus <marker>da da</marker> > dad.</example> > <example type="correct">Lavaret em eus da’z tad.</example> > </rule> > > Those 2 rules are almost the same. I wish I could write them in > one single rule with the pattern.... Right. You think of conditional search replace (if ma, then ’m; if da, then ’z). If it were ma -> ’m and da -> ’d, then you could simply replace to ’$1, where you'd match ([dm]) in the regexp. Now, since you have ’z as the second replacement, you can try another trick. Simply make two <match> elements: first for "ma", second for "da", and make sure they are exclusive. One will produce an empty string, and another the string you want. I did not test it, but the idea is simple enough. The only caveat is that I don't remember what <match> does by default if it produces an empty string via substitution. For some time, it did produce the original string in parentheses, but we can change it easily if it still does (I remember I changed some of this because of spell-checking). Regards, Marcin ------------------------------------------------------------------------------ Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel