On 2015-09-07 10:55, Dominique Pellé wrote:

> If <regexp> cannot be combined with <token> then we lose the ability to 
> have
> inflected="yes", postag="...", etc.
> 
> But I wonder how expensive this is.  I assume that tokenization makes 
> matching
> faster than applying regexp to whole sentences. If it slows down LT a 
> lot (to be
> confirmed), then we should not abuse the <regexp>... </regexp> feature.

I'm more concerned about how complex the code becomes if we implement it 
this way. Implementing regex the simple way has already huge potential 
for simplification. Here are the percentages of "simple" rules, i.e. 
rules that only use 'regex', 'case_sensitive' and 'skip=-1' in all of 
their tokens, which translates easily to regex (it's a fast hack, but 
the numbers should roughly be correct):

60% for Asturian
58% for Belarusian
44% for Breton
28% for Catalan
52% for Chinese
64% for Danish
58% for Dutch
56% for English
24% for Esperanto
64% for French
63% for Galician
55% for German
46% for Greek
72% for Icelandic
41% for Italian
91% for Japanese
38% for Khmer
0% for Lithuanian
68% for Malayalam
97% for Persian
35% for Polish
88% for Portuguese
29% for Romanian
38% for Russian
50% for Slovak
32% for Slovenian
19% for Spanish
55% for Swedish
38% for Tagalog
42% for Tamil
27% for Ukrainian
(source: SimpleRuleCounter.java in languagetool-dev)

Regards
  Daniel


------------------------------------------------------------------------------
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to