Re: [Languagetool] Possible way of speeding up LanguageTool

2012-01-14 Thread Jan Schreiber
Dominique Pellé wrote:
 The slowest startup times are for Chinese, German
 and Ukrainian. It would be interesting to find why.

That's true.

Some of my rules in the German file do checks against huge regular
expressions, which is probably quite inelegant. This might be a reason.

--
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!
http://p.sf.net/sfu/rsa-sfdev2dev2
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] Possible way of speeding up LanguageTool

2012-01-14 Thread Marcin Miłkowski
W dniu 2012-01-14 01:39, Jimmy O'Regan pisze:
 2012/1/13 Dominique Pellédominique.pe...@gmail.com:
 I see that Ukrainian tokenizer uses a MySpell
 (UkrainianMyspellTagger class) which reads and
 parses a text file dist/resource/uk/ukrainian.dict
 of 1,841,900 bytes, so that's not fast.   It is also
 using String.match(regexp) to parse it which is not fast.
 Using the Matcher class should be faster as indicated here:
   http://www.regular-expressions.info/java.html
 Anyway, transforming the MySpell file into a binary
 dictionary should be faster as described here:
 http://languagetool.wikidot.com/developing-a-tagger-dictionary

 It was presumably done that way because there was no generally
 available morphological dictionary for Ukrainian. UGTag has been
 available since last year or the year before, so a dictionary
 generated from that would probably be better.

As far as I remember, the tagger based on MySpell is not only slow but 
also is not used in any rules. UGTag is the way to go and we had it on 
our GSoC list. Adding it should be fairly trivial, it is already in Java 
so simply writing up an interface should be easy. But then again, we 
need someone who'd use the thing to write rules ;)

Regards
Marcin



--
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!
http://p.sf.net/sfu/rsa-sfdev2dev2
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] Possible way of speeding up LanguageTool

2012-01-14 Thread Yakov Reztsov

14 января 2012, 16:50 от Marcin Miłkowski 
 W dniu 2012-01-14 00:14, Dominique Pellé pisze:
  Dominique Pellé wrote:
 
  Marco A.G.Pinto wrote:
 
  Hello!
 
  I don't know how it was coded but I tried to run a script Daniel told me
  about and it took a lot of time.
 
  My suggestion is that when starting LibreOffice instead of reading from
  the XMLs, the contents were already in files containing arrays with the
  words.
 
  Let me give you an example for the English XML file:
  ARRAY:
  Possible Typos
  2
  ARE_STILL_THE_SOME
  IS_EVEN_WORST
  Grammar
  1
  WANT_THAT_I
  #END#
 
  This is a very simple example of optimization. For example, it has the
  type of grammar error and in the next position the number of entries of it
  (convert string to number). It ends with a #END#.
 
 
  Hi Marco
 
  I can't say that I like it. It would be messy. XML syntax can be
  checked for example. Anyway, I doubt that it would help to speed up.
 
  Before doing optimizations it is always necessary to measure first
  otherwise you might try to optimize something that takes only 1%
  of the time in the first place. Measuring also lets you verify objectively
  with numbers that whatever you change, actually helps to speed up.
 
  After reading your email, I was curious about the startup time (in
  command line, not in LibreOffice), so I created a script to measure
  it for all languages:
 
  For each language, the script:
 
  - counts the number of XML rules. I used the latest in SVN r6239
 so it can  different slightly from numbers at
 http://www.languagetool.org/languages/
  - measures startup time (3 times to avoid outliers) when
 launching LanguageTool with an empty sentence.
 
  Here is the script: http://dominique.pelle.free.fr/startup-time-lt.sh
 
  Here is the result:
 
  $ cd languagetool
  $ ./startup-time-lt.sh
 
  lang | #rules | startup time in sec (3 samples)
  -++
ast | 61 |  0.29 0.26 0.27
 br |353 |  0.41 0.39 0.40
 ca |214 |  0.83 0.83 0.83
 zh |328 |  2.34 2.30 2.28
 da | 22 |  0.82 0.80 0.79
 nl |336 |  0.88 0.88 0.89
 en |789 |  0.96 0.95 0.96
 eo |262 |  0.84 0.85 0.84
 fr |   2015 |  0.51 0.52 0.51
 gl |157 |  0.96 0.97 0.97
 de |717 |  1.74 1.91 1.76
 is | 39 |  0.77 0.82 0.81
 it | 94 |  0.28 0.28 0.28
 km | 24 |  0.88 0.84 0.83
 lt |  6 |  0.20 0.21 0.22
 ml | 23 |  0.81 0.78 0.81
 pl |   1029 |  1.17 1.17 1.16
 ro |452 |  0.99 0.98 0.94
 ru |149 |  0.95 0.91 0.92
 sk | 58 |  1.00 0.95 0.93
 sl | 86 |  0.80 0.86 0.83
 es | 70 |  0.91 0.85 0.84
 sv | 26 |  0.29 0.26 0.27
 tl | 44 |  0.25 0.26 0.25
 uk | 12 |  1.80 1.90 1.84
 
  This was measured on a 5 year old laptop
  (Linux x86, Intel(R) Core(TM) Duo CPU T2250  @ 1.73GHz)
 
  What's interesting here, is that the startup time
  does not strongly depend on the number of XML
  rules.  Ukrainian (uk) has only 12 xml rules, yet it
  is 3.52 times slower than French (fr) which has
  2015 rules!
 
  The slowest startup times are for Chinese, German
  and Ukrainian. It would be interesting to find why.
 
  Regards
  -- Dominique
 
 
  I spent a bit of time to find why the
  startup time is slow with Ukrainian (uk)
  (~1.80 sec) and yet it has only 12 xml rules:
 
  uk | 12 |  1.80 1.90 1.84
 
  First, I disabled the SRX tokenizer by
  commenting out getSentenceTokenizer()
  in src/java/org/languagetool/language/Ukrainian.java
  This saves about half a second to bring
  the startup time to:
 
 uk | 12 |  1.31 1.31 1.29
 
  Disabling the SRX tokenizer in Esperanto.java
  also saved half a second at startup for Esperanto.
 
  I find it odd that SRX file src/resource/segment.srx
  contains the rules for *all* languages. Wouldn't
  it makes more sense to have a smaller SRX file
  per language?  The function that loads it
  (SRXSentenceTokenizer in
  src/java/org/languagetool/tokenizers/SRXSentenceTokenizer.java)
  knows the language.
 
 But it would be against the SRX standard, which uses a hierarchy of
 processing. Note that for all languages, we have a few common rules
 (paragraph split etc.) and others could fall back on English etc. So
 splitting is not actually a problem, it is the feature of SRX to have
 these things together,.
 
 Anyway, the SRX tokenizer uses regular expressions, and if they are
 monsters (too many disjunctions), it could be slow. Probably optimizing
 the regexps in SRX will result in more saving than splitting and reading
 separately. Note that for other SRX languages, we don't have too much of
 an overhead.

I try rewrite Ukrainian section in SRX file to remove some regular expressions.


--
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!