Re: Suggestion: find POS tag of portion of a word in XML rules

Dominique Pellé Wed, 10 Sep 2014 12:44:00 -0700

Marcin Miłkowski <list-addr...@wp.pl> wrote:

W dniu 2014-09-10 11:34, Dominique Pellé pisze:
> > Marcin Miłkowski <list-addr...@wp.pl <mailto:list-addr...@wp.pl>> wrote:
> >
> >     W dniu 2014-09-09 23:10, Dominique Pellé pisze:
> >      > Daniel Naber <daniel.na...@languagetool.org
> >     <mailto:daniel.na...@languagetool.org>
> >      > <mailto:daniel.na...@languagetool.org
> >     <mailto:daniel.na...@languagetool.org>>> wrote:
> >      >
> >      >     On 2014-09-09 22:38, Dominique Pellé wrote:
> >      >
> >      >     > * why does your example give a message in
> >      >     >   the java rule.  Why can't we use <message…></message>
> >      >     >   instead?
> >      >
> >      >     You're right, my example was misleading. <message> can be
> used.
> >      >
> >      >     > * you wrote that args="no:1" refers to the token.
> >      >     >   What about if we need to use this for one of the
> >      >     >    <exception>...</exception> inside a token?
> >      >
> >      >     We could introduce more attributes like maybe 'regexp_negate'.
> >      >
> >      >     > In other words, the rule matches token "(.*)-tu"  where
> >      >     > the POS of portion in parentheses has to be a verb (V.*).
> >      >     > But there is an exception if the POS of partion in
> parenthesis
> >      >     > matches "V.* 2 .*". So that rule would correctly:
> >      >
> >      >     Couldn't that also be expressed with "V.* [13] .*"?
> >      >
> >      >
> >      >
> >      > No, that would miss at least infinitive verbs "V inf"
> >      > (e.g. chanter) participles  "V ppa m s"  (chanté)
> >      > and "V ppr" (chantant).
> >      >
> >      > We could of course come up with a regexp that
> >      > matches all the possible verbs POS  except those
> >      > "V.* 2 .*" to avoid an exception, but:
> >      >
> >      > * that regexp might be rather long as there are
> >      >    many kinds of  POS verbs. Using an exception is
> >      >    this more natural.
> >      > * and more generally speaking, being able to
> >      >    match POS of portion of token in exception
> >      >    can be useful in some other cases anyway too.
> >
> >     Let me understand your problem:
> >
> >     * you want to match all verbs (V.*) that have "-tu" at the end (this
> is
> >     <token>postag='V.*' postag_regexp="yes" regexp="yes">.*-tu</token>)
> >     * but not the ones that have verb in a second person: V.* 2 .* So why
> >     not simply use the old <exception postag="V.* 2 .*"
> >     postag_regexp="yes"/>? It will be a little bit slow due to regular
> >     expressions but does everything you need, right? Or am I missing
> >     something
> >
> >
> >
> > Hi Marcin
> >
> > Not exactly.
> >
> > I want to find error in things like "Peut-tu" and "Peux-il" which
> > are both incorrect in French. Correct should be "Peux-tu" (= Can you...)
> > and "Peut-il" (Can he...)
> >
> > "Peut-tu" token does not have a POS tag (so what you wrote above
> > does not work).  It's an invalid word. Interestingly, it's not even
> marked
> > as invalid by the spelling checker, because Hunspell splits it with
> > the dash, and "Peut" as well as "tu" are both valid words.
> >
> > To detect "Peut-tu" as a mistake, a grammar rule could check that
> > the POS tag of the portion "Peut" is "V ind pres 3 s" and since -tu
> > expects a verb before the dash with POS like "V .* 2 .*",
> > there is a mistake.
> >
> > For the correct "Peux-tu", the portion "Peux" has POS tag "V ind pres 2
> s"
> > and since -tu expects a verb before the bash with POS like "V .*2 .*"
> > no error would be given.
> >
> > However, I currently don't have a mechanism to find the POS tag of a
> > portion of a token such "Peut-tu" so I  can't write such a rule.
> >
> > I hope that's clearer now.
>
> Yes, I understand now. But isn't it simpler and more universal to make
> tokenization a little bit different? Even "voulez-vous" is not analyzed
> correctly, and it's pretty trivial to add special cases for the
> tokenizer to split words containing a bunch of personal tokens, as long
> as the first word is a verb. I do such intra-word tokenization for
> Polish for complex adjectives, and it's helpful for many uses, not just
> for the rule you mentioned.
>
> Regards,
> Marcin
>



Yes, I'll have a look at changing the tokenizer for French
to split things like "Peux-tu", as also suggested by Jaume.
I just glanced at the Catalan and Polish tokenizer, which do
that kinds of things. I don't want to split all words with hyphens,
as this would make it harder  to write rules in mot cases
A word like "face-à-face" is just one nouns and should
be treated as one token, but I'll make special cases
in tokenizer to splot things matching:
   .*-{je,tu,ils?,elles?,on,nous,vous}

There may still be use cases for getting the POS tag of
portion of tokens, but hopefully, tweaking the tokenization
will suffice for now for me.

Regards
Dominique

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: Suggestion: find POS tag of portion of a word in XML rules

Reply via email to