Daniel Naber <daniel.na...@languagetool.org> wrote:
> On 2014-04-27 22:18, Dominique Pellé wrote:
>
>> I wish I could check the POS tag of a portion of
>> a token.
>
> (Replying to an old thread here...)
>
> I think the new rule filter offers a solution for this that does not
> require any changes to the XML. Mike posted an example where he wanted
> to match in(.*) and un(.*), but only if the matching part is an
> adjective.
>
> It might work with a PartialPosTagFilter (that is yet to be developed)
> like this:
>
> <pattern>
> <token regexp="yes">(in|un).*</token>
> <pattern>
> <filter class="org.lt.PartialPosTagFilter" args="no:1
> regexp:(?:in|un)(.*) postag_regexp:JJ message:'Approved adjective, but
> not-approved prefix. Use NOT + <adjective>.'"/>
>
> no:1 refers to the relevant token
> regexp:(?:in|un)(.*) specifies the part of the token to be checked
> postag_regexp:JJ checks whether the matching part of the regex matches
> 'JJ' (adjective)
> message: the message to be shown if postag_regexp matches; if it
> doesn't, the rule match will be discarded
>
> So as with any rule filter, we have a rule that matches candidates, and
> a filter that keeps only those matches that are actually useful.
>
> What do you think?
Hi Daniel
Thanks for replying on this old thread. I would will like
to have this feature somehow for French and Breton
at least. But I did not understand well how it would work
with the <filter class=...> proposal:
* why does your example give a message in
the java rule. Why can't we use <message…></message>
instead?
* you wrote that args="no:1" refers to the token.
What about if we need to use this for one of the
<exception>...</exception> inside a token?
In fact, in one of my earlier message in this thread,
I had an example where I wanted to match POS tag
of an portion of a token as well as in a portion of the
exception:
<pattern>
<token regexp="yes" postag_group1="V.*" postag_group1_regexp="yes">(.*)-tu
<exception><token regexp="yes" postag_group1="V.* 2 .*"
postag_group1_regexp="yes">(.*)-tu</exception>
</token>
</pattern>
In other words, the rule matches token "(.*)-tu" where
the POS of portion in parentheses has to be a verb (V.*).
But there is an exception if the POS of partion in parenthesis
matches "V.* 2 .*". So that rule would correctly:
* signal "Peut-tu" as an error. (since "Peut" has postag
"V ind pres 3 s" which matches the token but not the
exception).
* ignore "Peux-tu" since it's correct ("Peux" has postag
"V ind pres 2 s" which matches the token and exception)
Of course the syntax was rejected at he time (which I
can understand as introducing postag_group1 is not nice),
but it illustrates the fact that we'd need to allow such a
mechanism not only for tokens, but also for exceptions.
Maybe Marcin proposal is better. But I did not fully
understand it either. Marcin wrote:
<token postag="V.*" postag_regexp='yes'><match no="3"
regexp_match="(.*)-tu" regexp_replace="$1" setpos="yes"/>[whatever you
want here]</token>
What is [whatever you want here] meant to be?
And why no="3"?
I think I need a concrete example to make sure I understand.
If I'm guessing correctly, the French rule that I proposed above
would be written using Marcin proposal as:
<token postag="V.*" postag_regexp="yes">
<match no="1" regexp_match="(.*)-tu" regexp_replace="$1" setpos="yes"/>
<exception postag="V.* 2 .*" postag_regexp="yes">
<match no="1" regexp_match="(.*)-tu" regexp_replace="$1" setpos="yes"/>
</exception>
</token>
It looks a tad hairy and I'm not sure I understood correctly the proposals.
Regards
Dominique
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel