buggy tagging in Slovak?

Milos Sramek Tue, 25 Jun 2013 01:13:56 -0700

Hi,

I have created my first rule. It somehow works, but its behaviour isweird. The rule should find a phrase consisting of'koho' + inflected byť + some arbitrary words + noun (inflected). Theweird thing is that segments of different length are marked, even thoughthey are grammatically equivalent (marked segment is in parentheses):


Nevedel, {koho je to nové auto}. (He did not know whose is the new car)
Nevedel, {koho je to modré} auto.

Nové (new) and modré (blue) are both adjectives in the same form (shouldbe skipped).


Nevedel, {koho je to nové a} rýchle auto.  (new and fast)
Nevedel, {koho je to nové i} rýchle auto.
Nevedel, {koho je to nové aj rýchle auto}.

Here, 'a', 'i' and 'aj' are conjunctions, all meaning 'and' (should beskipped).

To understand the problem, I tried the command line version of LT withthe -t switch. According to its output, it seems to me that tagging isincorrect.


For example, for the 'aj' conjunction I get
aj[aj/J,aj/O,aj/T]
which is correct. However, for the 'a' conjunction I get

a[a/J,a/O,a/Q,a/SUnp1,a/SUnp2,a/SUnp3,a/SUnp4,a/SUnp5,a/SUnp6,a/SUnp7,a/SUns1,a/SUns2,a/SUns3,a/SUns4,a/SUns5,a/SUns6,a/SUns7,a/T,a/W,as/W]

So 'a' is tagged also as a noun (the S tag), which explains why marking in
Nevedel, {koho je to nové a} rýchle auto
stops after 'a'.

Similar happens with adjectives (the 'A' tag), which are recognized alsoas nouns (the 'S' tag)

modré[modré/SAns1,modré/SAns4,modré/SAns5,modrá/SAfp1,modrá/SAfp4,modrá/SAfp5,modrý/AAfp1x,modrý/AAfp4x,modrý/AAfp5x,modrý/AAip1x,modrý/AAip4x,modrý/AAip5x,modrý/AAnp1x,modrý/AAnp4x,modrý/AAnp5x,modrý/AAns1x,modrý/AAns4x,modrý/AAns5x]

This is perhaps a problem outside LT. If it is so, I perhaps should talkto the author of the tagging tool. Do you have contact?


Thanks
Milos

The rule:
            <rule>
                <pattern>
                    <token>koho</token>
            <token skip="-1" inflected="yes">byť</token>

<token postag="S...." postag_regexp="yes"></token># find a noun here

                </pattern>
                <message>
            Zámeno 'koho' nahraďte zámenom 'čí':

<suggestion><match no="3" postag="S.(.)(.)(.)"postag_replace="PA$1$2$3">čí</match> <match no="2"include_skipped="all"/> <match no="3"/></suggestion>.

                </message>

<short>Možná gramatická chyba. Zámeno 'koho' možnotreba nahradiť zámenom 'čí' (pravidlo 3)</short><example correction="čí bol ten vysoký dom"type="incorrect">

                  Nevedel, <marker>koho bol ten vysoký dom</marker>.
                </example>
                <example type="correct">
                  Nevedel, čí bol ten vysoký dom.
                </example>
            </rule>

-- email & jabber: [email protected]

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev

_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

buggy tagging in Slovak?

Reply via email to