Marcin Miłkowski <list-addr...@wp.pl> wrote:
W dniu 2014-09-10 11:34, Dominique Pellé pisze:
> > Marcin Miłkowski <list-addr...@wp.pl <mailto:list-addr...@wp.pl>> wrote:
> >
> > W dniu 2014-09-09 23:10, Dominique Pellé pisze:
> > > Daniel Naber <daniel.na...@languagetool.org
> > <mailto:daniel.na...@languagetool.org>
> > > <mailto:daniel.na...@languagetool.org
> > <mailto:daniel.na...@languagetool.org>>> wrote:
> > >
> > > On 2014-09-09 22:38, Dominique Pellé wrote:
> > >
> > > > * why does your example give a message in
> > > > the java rule. Why can't we use <message…></message>
> > > > instead?
> > >
> > > You're right, my example was misleading. <message> can be
> used.
> > >
> > > > * you wrote that args="no:1" refers to the token.
> > > > What about if we need to use this for one of the
> > > > <exception>...</exception> inside a token?
> > >
> > > We could introduce more attributes like maybe 'regexp_negate'.
> > >
> > > > In other words, the rule matches token "(.*)-tu" where
> > > > the POS of portion in parentheses has to be a verb (V.*).
> > > > But there is an exception if the POS of partion in
> parenthesis
> > > > matches "V.* 2 .*". So that rule would correctly:
> > >
> > > Couldn't that also be expressed with "V.* [13] .*"?
> > >
> > >
> > >
> > > No, that would miss at least infinitive verbs "V inf"
> > > (e.g. chanter) participles "V ppa m s" (chanté)
> > > and "V ppr" (chantant).
> > >
> > > We could of course come up with a regexp that
> > > matches all the possible verbs POS except those
> > > "V.* 2 .*" to avoid an exception, but:
> > >
> > > * that regexp might be rather long as there are
> > > many kinds of POS verbs. Using an exception is
> > > this more natural.
> > > * and more generally speaking, being able to
> > > match POS of portion of token in exception
> > > can be useful in some other cases anyway too.
> >
> > Let me understand your problem:
> >
> > * you want to match all verbs (V.*) that have "-tu" at the end (this
> is
> > <token>postag='V.*' postag_regexp="yes" regexp="yes">.*-tu</token>)
> > * but not the ones that have verb in a second person: V.* 2 .* So why
> > not simply use the old <exception postag="V.* 2 .*"
> > postag_regexp="yes"/>? It will be a little bit slow due to regular
> > expressions but does everything you need, right? Or am I missing
> > something
> >
> >
> >
> > Hi Marcin
> >
> > Not exactly.
> >
> > I want to find error in things like "Peut-tu" and "Peux-il" which
> > are both incorrect in French. Correct should be "Peux-tu" (= Can you...)
> > and "Peut-il" (Can he...)
> >
> > "Peut-tu" token does not have a POS tag (so what you wrote above
> > does not work). It's an invalid word. Interestingly, it's not even
> marked
> > as invalid by the spelling checker, because Hunspell splits it with
> > the dash, and "Peut" as well as "tu" are both valid words.
> >
> > To detect "Peut-tu" as a mistake, a grammar rule could check that
> > the POS tag of the portion "Peut" is "V ind pres 3 s" and since -tu
> > expects a verb before the dash with POS like "V .* 2 .*",
> > there is a mistake.
> >
> > For the correct "Peux-tu", the portion "Peux" has POS tag "V ind pres 2
> s"
> > and since -tu expects a verb before the bash with POS like "V .*2 .*"
> > no error would be given.
> >
> > However, I currently don't have a mechanism to find the POS tag of a
> > portion of a token such "Peut-tu" so I can't write such a rule.
> >
> > I hope that's clearer now.
>
> Yes, I understand now. But isn't it simpler and more universal to make
> tokenization a little bit different? Even "voulez-vous" is not analyzed
> correctly, and it's pretty trivial to add special cases for the
> tokenizer to split words containing a bunch of personal tokens, as long
> as the first word is a verb. I do such intra-word tokenization for
> Polish for complex adjectives, and it's helpful for many uses, not just
> for the rule you mentioned.
>
> Regards,
> Marcin
>
Yes, I'll have a look at changing the tokenizer for French
to split things like "Peux-tu", as also suggested by Jaume.
I just glanced at the Catalan and Polish tokenizer, which do
that kinds of things. I don't want to split all words with hyphens,
as this would make it harder to write rules in mot cases
A word like "face-à-face" is just one nouns and should
be treated as one token, but I'll make special cases
in tokenizer to splot things matching:
.*-{je,tu,ils?,elles?,on,nous,vous}
There may still be use cases for getting the POS tag of
portion of tokens, but hopefully, tweaking the tokenization
will suffice for now for me.
Regards
Dominique
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel