14 января 2012, 16:50 от Marcin Miłkowski
W dniu 2012-01-14 00:14, Dominique Pellé pisze:
Dominique Pellé wrote:
Marco A.G.Pinto wrote:
Hello!
I don't know how it was coded but I tried to run a script Daniel told me
about and it took a lot of time.
My suggestion is that when starting LibreOffice instead of reading from
the XMLs, the contents were already in files containing arrays with the
words.
Let me give you an example for the English XML file:
ARRAY:
Possible Typos
2
ARE_STILL_THE_SOME
IS_EVEN_WORST
Grammar
1
WANT_THAT_I
#END#
This is a very simple example of optimization. For example, it has the
type of grammar error and in the next position the number of entries of it
(convert string to number). It ends with a #END#.
Hi Marco
I can't say that I like it. It would be messy. XML syntax can be
checked for example. Anyway, I doubt that it would help to speed up.
Before doing optimizations it is always necessary to measure first
otherwise you might try to optimize something that takes only 1%
of the time in the first place. Measuring also lets you verify objectively
with numbers that whatever you change, actually helps to speed up.
After reading your email, I was curious about the startup time (in
command line, not in LibreOffice), so I created a script to measure
it for all languages:
For each language, the script:
- counts the number of XML rules. I used the latest in SVN r6239
so it can different slightly from numbers at
http://www.languagetool.org/languages/
- measures startup time (3 times to avoid outliers) when
launching LanguageTool with an empty sentence.
Here is the script: http://dominique.pelle.free.fr/startup-time-lt.sh
Here is the result:
$ cd languagetool
$ ./startup-time-lt.sh
lang | #rules | startup time in sec (3 samples)
-++
ast | 61 | 0.29 0.26 0.27
br |353 | 0.41 0.39 0.40
ca |214 | 0.83 0.83 0.83
zh |328 | 2.34 2.30 2.28
da | 22 | 0.82 0.80 0.79
nl |336 | 0.88 0.88 0.89
en |789 | 0.96 0.95 0.96
eo |262 | 0.84 0.85 0.84
fr | 2015 | 0.51 0.52 0.51
gl |157 | 0.96 0.97 0.97
de |717 | 1.74 1.91 1.76
is | 39 | 0.77 0.82 0.81
it | 94 | 0.28 0.28 0.28
km | 24 | 0.88 0.84 0.83
lt | 6 | 0.20 0.21 0.22
ml | 23 | 0.81 0.78 0.81
pl | 1029 | 1.17 1.17 1.16
ro |452 | 0.99 0.98 0.94
ru |149 | 0.95 0.91 0.92
sk | 58 | 1.00 0.95 0.93
sl | 86 | 0.80 0.86 0.83
es | 70 | 0.91 0.85 0.84
sv | 26 | 0.29 0.26 0.27
tl | 44 | 0.25 0.26 0.25
uk | 12 | 1.80 1.90 1.84
This was measured on a 5 year old laptop
(Linux x86, Intel(R) Core(TM) Duo CPU T2250 @ 1.73GHz)
What's interesting here, is that the startup time
does not strongly depend on the number of XML
rules. Ukrainian (uk) has only 12 xml rules, yet it
is 3.52 times slower than French (fr) which has
2015 rules!
The slowest startup times are for Chinese, German
and Ukrainian. It would be interesting to find why.
Regards
-- Dominique
I spent a bit of time to find why the
startup time is slow with Ukrainian (uk)
(~1.80 sec) and yet it has only 12 xml rules:
uk | 12 | 1.80 1.90 1.84
First, I disabled the SRX tokenizer by
commenting out getSentenceTokenizer()
in src/java/org/languagetool/language/Ukrainian.java
This saves about half a second to bring
the startup time to:
uk | 12 | 1.31 1.31 1.29
Disabling the SRX tokenizer in Esperanto.java
also saved half a second at startup for Esperanto.
I find it odd that SRX file src/resource/segment.srx
contains the rules for *all* languages. Wouldn't
it makes more sense to have a smaller SRX file
per language? The function that loads it
(SRXSentenceTokenizer in
src/java/org/languagetool/tokenizers/SRXSentenceTokenizer.java)
knows the language.
But it would be against the SRX standard, which uses a hierarchy of
processing. Note that for all languages, we have a few common rules
(paragraph split etc.) and others could fall back on English etc. So
splitting is not actually a problem, it is the feature of SRX to have
these things together,.
Anyway, the SRX tokenizer uses regular expressions, and if they are
monsters (too many disjunctions), it could be slow. Probably optimizing
the regexps in SRX will result in more saving than splitting and reading
separately. Note that for other SRX languages, we don't have too much of
an overhead.
I try rewrite Ukrainian section in SRX file to remove some regular expressions.
--
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!