Hi,
I have created my first rule. It somehow works, but its behaviour is
weird. The rule should find a phrase consisting of
'koho' + inflected byť + some arbitrary words + noun (inflected). The
weird thing is that segments of different length are marked, even though
they are grammatically equivalent (marked segment is in parentheses):
Nevedel, {koho je to nové auto}. (He did not know whose is the new car)
Nevedel, {koho je to modré} auto.
Nové (new) and modré (blue) are both adjectives in the same form (should
be skipped).
Nevedel, {koho je to nové a} rýchle auto. (new and fast)
Nevedel, {koho je to nové i} rýchle auto.
Nevedel, {koho je to nové aj rýchle auto}.
Here, 'a', 'i' and 'aj' are conjunctions, all meaning 'and' (should be
skipped).
To understand the problem, I tried the command line version of LT with
the -t switch. According to its output, it seems to me that tagging is
incorrect.
For example, for the 'aj' conjunction I get
aj[aj/J,aj/O,aj/T]
which is correct. However, for the 'a' conjunction I get
a[a/J,a/O,a/Q,a/SUnp1,a/SUnp2,a/SUnp3,a/SUnp4,a/SUnp5,a/SUnp6,a/SUnp7,a/SUns1,a/SUns2,a/SUns3,a/SUns4,a/SUns5,a/SUns6,a/SUns7,a/T,a/W,as/W]
So 'a' is tagged also as a noun (the S tag), which explains why marking in
Nevedel, {koho je to nové a} rýchle auto
stops after 'a'.
Similar happens with adjectives (the 'A' tag), which are recognized also
as nouns (the 'S' tag)
modré[modré/SAns1,modré/SAns4,modré/SAns5,modrá/SAfp1,modrá/SAfp4,modrá/SAfp5,modrý/AAfp1x,modrý/AAfp4x,modrý/AAfp5x,modrý/AAip1x,modrý/AAip4x,modrý/AAip5x,modrý/AAnp1x,modrý/AAnp4x,modrý/AAnp5x,modrý/AAns1x,modrý/AAns4x,modrý/AAns5x]
This is perhaps a problem outside LT. If it is so, I perhaps should talk
to the author of the tagging tool. Do you have contact?
Thanks
Milos
The rule:
<rule>
<pattern>
<token>koho</token>
<token skip="-1" inflected="yes">byť</token>
<token postag="S...." postag_regexp="yes"></token>
# find a noun here
</pattern>
<message>
Zámeno 'koho' nahraďte zámenom 'čí':
<suggestion><match no="3" postag="S.(.)(.)(.)"
postag_replace="PA$1$2$3">čí</match> <match no="2"
include_skipped="all"/> <match no="3"/></suggestion>.
</message>
<short>Možná gramatická chyba. Zámeno 'koho' možno
treba nahradiť zámenom 'čí' (pravidlo 3)</short>
<example correction="čí bol ten vysoký dom"
type="incorrect">
Nevedel, <marker>koho bol ten vysoký dom</marker>.
</example>
<example type="correct">
Nevedel, čí bol ten vysoký dom.
</example>
</rule>
-- email & jabber: [email protected]
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel