W dniu 2014-04-29 07:02, Dominique Pellé pisze: > Daniel Naber <daniel.na...@languagetool.org > <mailto:daniel.na...@languagetool.org>> wrote: > > On 2014-04-27 22:18, Dominique Pellé wrote: > > > <token regexp="yes" postag_group1="foo">ez-(.*)</token> > > I'm not sure how this could be implemented in a clean way... wouldn't > this be a rather ugly special case in the tagger to ignore the > tokenization and also split at the hyphen? > > > > I'm not sure either how it would be implemented not knowing > that code well enough. I have not tried to implement it. But I don't > think there should be a special case for the hyphen. My example > contains a hyphen, but hyphens should not be special. > The POS tag should rather be probed on the > pattern.matcher.group(1) or the regexp. > > It's not an ugly case. It's a useful general purpose feature, which > can avoid writing Java rules. Writing a Java rules is uglier. > > Another example where I could use it is for French conjugated > verbs in interrogations such as "Peux-tu" (=can you), > "Peut-il" (=can he)... where the verb and the pronoun are in > the same token in interrogations (again with an hyphen in this > example). > > Right not, erroneous French conjugations such as *Peut-tu* are > not detected as an error by LanguageTool (false negative). > I could detect it as an error if I could do something more or less > like this: > > <pattern> > <token regexp="yes" postag_group1="V.*" > postag_group1_regexp="yes">(.*)-tu > <exception><token regexp="yes" postag_group1="V.* 2 .*" > postag_group1_regexp="yes">(.*)-tu</exception> > </token> > </pattern> > > This would check that what matches (.*) in the token, is a > conjugated verb in the 2rd singular form (i.e. "V.* 2 .*"). > > The French grammar checker "Grammalecte" based on Lightproof > correctly detects *peux-tu* as an error. Grammalecte or Lightproof do > not tokenize, so it's quite different than LanguageTool. Glancing at > Grammalecte rules (Grammalecte-0.3.9.1/fr-rules.txt), it detects the > error using such a rule: > > (\w+)-tu <- option("inte") and not morph(\1, > "po:(.pre|.imp|ipsi|ifut|cond).* po:2sg", False) and spell(\1) and not > re.match("(?i)vite$", \1) > -1> _ # Forme interrogative. « \1 » n’est pas un verbe à la > deuxième personne du singulier.
Well, why should we invent a new piece of XML machinery when we already have something similar with the <match> element? Basically, you want to search and replace the token surface form, and then tag it. I think we could simply adapt the syntax we already use for the synthesizer: <token postag="V.*" postag_regexp='yes'><match regexp_match="(.*)-tu" regexp_replace="$1" setpos="yes"/>[whatever you want here]</token> And this would simply apply the regexp replace and run the tagger on it. Note that this syntax is almost correct right now and LT won't complain about it,only weird things will happen, as it doesn't have any consistent semantics. Almost, because you need to say: <token postag="V.*" postag_regexp='yes'><match no="3" regexp_match="(.*)-tu" regexp_replace="$1" setpos="yes"/>[whatever you want here]</token> And @no attribute is required. So basically, the functionality is almost there and it would be fairly easy to add it via the reference setting in our code. All in all, there are several more uses of <match> that are not yet fully supported and the code is a bit ugly, I must say. Regards, Marcin ------------------------------------------------------------------------------ "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available. Simple to use. Nothing to install. Get started now for free." http://p.sf.net/sfu/SauceLabs _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel