W dniu 2012-01-14 01:39, Jimmy O'Regan pisze:
> 2012/1/13 Dominique Pellé<dominique.pe...@gmail.com>:
>> I see that Ukrainian tokenizer uses a MySpell
>> (UkrainianMyspellTagger class) which reads and
>> parses a text file dist/resource/uk/ukrainian.dict
>> of 1,841,900 bytes, so that's not fast.   It is also
>> using String.match("regexp") to parse it which is not fast.
>> Using the Matcher class should be faster as indicated here:
>>   http://www.regular-expressions.info/java.html
>> Anyway, transforming the MySpell file into a binary
>> dictionary should be faster as described here:
>> http://languagetool.wikidot.com/developing-a-tagger-dictionary
>
> It was presumably done that way because there was no generally
> available morphological dictionary for Ukrainian. UGTag has been
> available since last year or the year before, so a dictionary
> generated from that would probably be better.

As far as I remember, the tagger based on MySpell is not only slow but 
also is not used in any rules. UGTag is the way to go and we had it on 
our GSoC list. Adding it should be fairly trivial, it is already in Java 
so simply writing up an interface should be easy. But then again, we 
need someone who'd use the thing to write rules ;)

Regards
Marcin



------------------------------------------------------------------------------
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!
http://p.sf.net/sfu/rsa-sfdev2dev2
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to