On 2015-09-07 10:55, Dominique Pellé wrote: > If <regexp> cannot be combined with <token> then we lose the ability to > have > inflected="yes", postag="...", etc. > > But I wonder how expensive this is. I assume that tokenization makes > matching > faster than applying regexp to whole sentences. If it slows down LT a > lot (to be > confirmed), then we should not abuse the <regexp>... </regexp> feature.
I'm more concerned about how complex the code becomes if we implement it this way. Implementing regex the simple way has already huge potential for simplification. Here are the percentages of "simple" rules, i.e. rules that only use 'regex', 'case_sensitive' and 'skip=-1' in all of their tokens, which translates easily to regex (it's a fast hack, but the numbers should roughly be correct): 60% for Asturian 58% for Belarusian 44% for Breton 28% for Catalan 52% for Chinese 64% for Danish 58% for Dutch 56% for English 24% for Esperanto 64% for French 63% for Galician 55% for German 46% for Greek 72% for Icelandic 41% for Italian 91% for Japanese 38% for Khmer 0% for Lithuanian 68% for Malayalam 97% for Persian 35% for Polish 88% for Portuguese 29% for Romanian 38% for Russian 50% for Slovak 32% for Slovenian 19% for Spanish 55% for Swedish 38% for Tagalog 42% for Tamil 27% for Ukrainian (source: SimpleRuleCounter.java in languagetool-dev) Regards Daniel ------------------------------------------------------------------------------ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel