Re: Suggestion: find POS tag of portion of a word in XML rules

Marcin Miłkowski Wed, 10 Sep 2014 06:19:26 -0700

W dniu 2014-09-10 11:34, Dominique Pellé pisze:
> Marcin Miłkowski <list-addr...@wp.pl <mailto:list-addr...@wp.pl>> wrote:
>
>     W dniu 2014-09-09 23:10, Dominique Pellé pisze:
>      > Daniel Naber <daniel.na...@languagetool.org
>     <mailto:daniel.na...@languagetool.org>
>      > <mailto:daniel.na...@languagetool.org
>     <mailto:daniel.na...@languagetool.org>>> wrote:
>      >
>      >     On 2014-09-09 22:38, Dominique Pellé wrote:
>      >
>      >     > * why does your example give a message in
>      >     >   the java rule.  Why can't we use <message…></message>
>      >     >   instead?
>      >
>      >     You're right, my example was misleading. <message> can be used.
>      >
>      >     > * you wrote that args="no:1" refers to the token.
>      >     >   What about if we need to use this for one of the
>      >     >    <exception>...</exception> inside a token?
>      >
>      >     We could introduce more attributes like maybe 'regexp_negate'.
>      >
>      >     > In other words, the rule matches token "(.*)-tu"  where
>      >     > the POS of portion in parentheses has to be a verb (V.*).
>      >     > But there is an exception if the POS of partion in parenthesis
>      >     > matches "V.* 2 .*". So that rule would correctly:
>      >
>      >     Couldn't that also be expressed with "V.* [13] .*"?
>      >
>      >
>      >
>      > No, that would miss at least infinitive verbs "V inf"
>      > (e.g. chanter) participles  "V ppa m s"  (chanté)
>      > and "V ppr" (chantant).
>      >
>      > We could of course come up with a regexp that
>      > matches all the possible verbs POS  except those
>      > "V.* 2 .*" to avoid an exception, but:
>      >
>      > * that regexp might be rather long as there are
>      >    many kinds of  POS verbs. Using an exception is
>      >    this more natural.
>      > * and more generally speaking, being able to
>      >    match POS of portion of token in exception
>      >    can be useful in some other cases anyway too.
>
>     Let me understand your problem:
>
>     * you want to match all verbs (V.*) that have "-tu" at the end (this is
>     <token>postag='V.*' postag_regexp="yes" regexp="yes">.*-tu</token>)
>     * but not the ones that have verb in a second person: V.* 2 .* So why
>     not simply use the old <exception postag="V.* 2 .*"
>     postag_regexp="yes"/>? It will be a little bit slow due to regular
>     expressions but does everything you need, right? Or am I missing
>     something
>
>
>
> Hi Marcin
>
> Not exactly.
>
> I want to find error in things like "Peut-tu" and "Peux-il" which
> are both incorrect in French. Correct should be "Peux-tu" (= Can you...)
> and "Peut-il" (Can he...)
>
> "Peut-tu" token does not have a POS tag (so what you wrote above
> does not work).  It's an invalid word. Interestingly, it's not even marked
> as invalid by the spelling checker, because Hunspell splits it with
> the dash, and "Peut" as well as "tu" are both valid words.
>
> To detect "Peut-tu" as a mistake, a grammar rule could check that
> the POS tag of the portion "Peut" is "V ind pres 3 s" and since -tu
> expects a verb before the dash with POS like "V .* 2 .*",
> there is a mistake.
>
> For the correct "Peux-tu", the portion "Peux" has POS tag "V ind pres 2 s"
> and since -tu expects a verb before the bash with POS like "V .*2 .*"
> no error would be given.
>
> However, I currently don't have a mechanism to find the POS tag of a
> portion of a token such "Peut-tu" so I  can't write such a rule.
>
> I hope that's clearer now.


Yes, I understand now. But isn't it simpler and more universal to make 
tokenization a little bit different? Even "voulez-vous" is not analyzed 
correctly, and it's pretty trivial to add special cases for the 
tokenizer to split words containing a bunch of personal tokens, as long 
as the first word is a verb. I do such intra-word tokenization for 
Polish for complex adjectives, and it's helpful for many uses, not just 
for the rule you mentioned.

Regards,
Marcin


------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: Suggestion: find POS tag of portion of a word in XML rules

Reply via email to