W dniu 2014-09-10 11:34, Dominique Pellé pisze: > Marcin Miłkowski <list-addr...@wp.pl <mailto:list-addr...@wp.pl>> wrote: > > W dniu 2014-09-09 23:10, Dominique Pellé pisze: > > Daniel Naber <daniel.na...@languagetool.org > <mailto:daniel.na...@languagetool.org> > > <mailto:daniel.na...@languagetool.org > <mailto:daniel.na...@languagetool.org>>> wrote: > > > > On 2014-09-09 22:38, Dominique Pellé wrote: > > > > > * why does your example give a message in > > > the java rule. Why can't we use <message…></message> > > > instead? > > > > You're right, my example was misleading. <message> can be used. > > > > > * you wrote that args="no:1" refers to the token. > > > What about if we need to use this for one of the > > > <exception>...</exception> inside a token? > > > > We could introduce more attributes like maybe 'regexp_negate'. > > > > > In other words, the rule matches token "(.*)-tu" where > > > the POS of portion in parentheses has to be a verb (V.*). > > > But there is an exception if the POS of partion in parenthesis > > > matches "V.* 2 .*". So that rule would correctly: > > > > Couldn't that also be expressed with "V.* [13] .*"? > > > > > > > > No, that would miss at least infinitive verbs "V inf" > > (e.g. chanter) participles "V ppa m s" (chanté) > > and "V ppr" (chantant). > > > > We could of course come up with a regexp that > > matches all the possible verbs POS except those > > "V.* 2 .*" to avoid an exception, but: > > > > * that regexp might be rather long as there are > > many kinds of POS verbs. Using an exception is > > this more natural. > > * and more generally speaking, being able to > > match POS of portion of token in exception > > can be useful in some other cases anyway too. > > Let me understand your problem: > > * you want to match all verbs (V.*) that have "-tu" at the end (this is > <token>postag='V.*' postag_regexp="yes" regexp="yes">.*-tu</token>) > * but not the ones that have verb in a second person: V.* 2 .* So why > not simply use the old <exception postag="V.* 2 .*" > postag_regexp="yes"/>? It will be a little bit slow due to regular > expressions but does everything you need, right? Or am I missing > something > > > > Hi Marcin > > Not exactly. > > I want to find error in things like "Peut-tu" and "Peux-il" which > are both incorrect in French. Correct should be "Peux-tu" (= Can you...) > and "Peut-il" (Can he...) > > "Peut-tu" token does not have a POS tag (so what you wrote above > does not work). It's an invalid word. Interestingly, it's not even marked > as invalid by the spelling checker, because Hunspell splits it with > the dash, and "Peut" as well as "tu" are both valid words. > > To detect "Peut-tu" as a mistake, a grammar rule could check that > the POS tag of the portion "Peut" is "V ind pres 3 s" and since -tu > expects a verb before the dash with POS like "V .* 2 .*", > there is a mistake. > > For the correct "Peux-tu", the portion "Peux" has POS tag "V ind pres 2 s" > and since -tu expects a verb before the bash with POS like "V .*2 .*" > no error would be given. > > However, I currently don't have a mechanism to find the POS tag of a > portion of a token such "Peut-tu" so I can't write such a rule. > > I hope that's clearer now.
Yes, I understand now. But isn't it simpler and more universal to make tokenization a little bit different? Even "voulez-vous" is not analyzed correctly, and it's pretty trivial to add special cases for the tokenizer to split words containing a bunch of personal tokens, as long as the first word is a verb. I do such intra-word tokenization for Polish for complex adjectives, and it's helpful for many uses, not just for the rule you mentioned. Regards, Marcin ------------------------------------------------------------------------------ Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel