Re: readable POS tags

Marcin Miłkowski Wed, 26 Mar 2014 09:50:39 -0700

W dniu 2014-03-26 13:51, Daniel Naber pisze:
> On 2014-03-25 14:24, Marcin Miłkowski wrote:
>
>>> So instead of just adding the POS tag we get from Morfologik to our
>>> AnalyzedToken object as a string, we interpret it and store something
>>> like pos = preposition, case = accusative. Is it that what you mean?
>>
>> Exactly.
>
> Any ideas on how the VBP tag (in English) might fit into this approach,
> i.e. "not 3rd person singular"? Will we need to introduce a tag like
> "pos = Not3rdPsSgVerb"? That doesn't seem elegant but keeps it short.


No, that would be horrible, as this is not an improvement. The problem 
is not that tags are cryptic and short; it is that they do not make 
features easily available separately.

My use case for readable pos tags is also speed and simplicity for 
unification (rules that use agreement between words). It is simply 
faster to specify features by citing appropriate attributes that can be 
processed once instead of running a regexp every time the sentence is 
processed in a unification rule. For Catalan, Polish, and French this 
will be a huge time improvement.

Now, for this to work the attributes should be specified just like they 
are in Corpus Query Language (CQL).

>
> Internally, it could be expanded to mean:
> [{pos=verb, person=1|2, number=singular, tense=preset},
>    {pos=verb, person=1|2|3, number=plural, tense=preset}]


So for the word tagged as VBP we could have

<token pos = "verb" person="1|2" number="sg" en:tense="present"/>

or

<token pos = "verb" person="3" number="pl" en:tense="present"/>

Both would match a word with VBP. (Note that the disambiguator could 
even remove one of the interpretations to make it clear that this is a 
plural use of the token!)

Above, I used a mixture of attributes without namespaces (these would be 
universal for all languages) and ones with namespaces, like tense, which 
is not present in all languages. We can look at proposed Universal 
Tagset to find universal categories:

https://code.google.com/p/universal-pos-tags/

Note also that one could write:

<token pos="verb"/>

And this would be equivalent to:

<token postag="VB.*" postag_regexp="yes"/>

But possibly a lot faster. The new syntax comes out also much easier to 
read, and would be equivalent to CQL query:

[pos="verb"]

Similarly for words in comparative degree, where you have to use now 
(for English):

<token postag="..R" postag_regexp="yes"/>

You could simply say:

<token degree="com"/>

Basically, by making attributes separate we could have a much easier way 
to write complex rules without problems as to how specify POS tags. I 
consider myself to be a power user but with a complex Polish tagset it 
is sometimes really difficult to specify the features I want using 
regexes: the tagset itself creates pretty complex and lengthy strings 
and a lot of time is needed to make sure that the regex matches.

Regards,
Marcin

>
> Regards
>    Daniel
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: readable POS tags

Reply via email to