Sorry, looks like the link I posted here does not match up with the Penn Treebank Tag Set we
are using. Some of the tags are not even included in our training data.

I went on and tried to find a better description and looked at this page:
http://www.cis.upenn.edu/~treebank/home.html

The tagset described here seems to match with the tags we are using:
ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz

But there seem to be some missing in the list, especially tags for punctuation, brackets, etc. Does someone know where a complete list with descriptions of the Penn Treebank
tags can be found?

Jörn

On 10/11/11 1:52 PM, Fotiadis, Konstantinos wrote:
After looking again, I think I probably didn't have them matched up perfectly. 
I just did a sort in Excel, and realized that maybe this would make more sense? 
(Sorry, been up for 51 hours straight!)


Penn Treebank Tag Set Definition

Produced by OpenNLP API

Tag

Definition

Tag

NNS

Noun, plural

NNS

NP

Proper noun, singular

NNP

NPS

Proper noun, plural

NNPS

PP

Personal pronoun

PRP

PP$

Possessive pronoun

PRP$




Is that right?



-----Original Message-----
From: Jörn Kottmann [mailto:kottm...@gmail.com]
Sent: Tuesday, October 11, 2011 6:13 AM
To: opennlp-users@incubator.apache.org
Subject: EXTERNAL: Re: POS Tags



The English POS Model from the SourceForge download page

uses the Penn Treebank Tag Set.



Here is a link which list all tags:

http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQP-HTMLDemo/PennTreebankTS.html



Jörn



On 10/11/11 6:56 AM, Fotiadis, Konstantinos wrote:

I am looking around the definition and have not found the definitions for the 
POS tags.
Can you help me with these?
Example:
"This is not a long sentence. I like turtles. Happiness is great!"
I then call SentenceDetectorME to detect sentences. Then loop through the 
sentences and call Tokenizer on each one. I then pass the token String array to 
POSTaggerME to get the POS. Here is my output:
Number of Sentences=3
SENTENCE_ID=1 - TOKENS=7 - This is not a long sentence.
    TOKEN_ID=1 - POS=DT - This
    TOKEN_ID=2 - POS=VBZ - is
    TOKEN_ID=3 - POS=RB - not
    TOKEN_ID=4 - POS=DT - a
    TOKEN_ID=5 - POS=JJ - long
    TOKEN_ID=6 - POS=NN - sentence
    TOKEN_ID=7 - POS=. - .
SENTENCE_ID=2 - TOKENS=4 - I like turtles.
    TOKEN_ID=1 - POS=PRP - I
    TOKEN_ID=2 - POS=IN - like
    TOKEN_ID=3 - POS=NNS - turtles
    TOKEN_ID=4 - POS=. - .
SENTENCE_ID=3 - TOKENS=4 - Happiness is great!
    TOKEN_ID=1 - POS=NNP - Happiness
    TOKEN_ID=2 - POS=VBZ - is
    TOKEN_ID=3 - POS=JJ - great
    TOKEN_ID=4 - POS=. - !
Just curious of the definitions...
Thanks, Kosta



Reply via email to