Sorry, looks like the link I posted here does not match up with the Penn
Treebank Tag Set we
are using. Some of the tags are not even included in our training data.
I went on and tried to find a better description and looked at this page:
http://www.cis.upenn.edu/~treebank/home.html
The tagset described here seems to match with the tags we are using:
ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz
But there seem to be some missing in the list, especially tags for
punctuation, brackets, etc.
Does someone know where a complete list with descriptions of the Penn
Treebank
tags can be found?
Jörn
On 10/11/11 1:52 PM, Fotiadis, Konstantinos wrote:
After looking again, I think I probably didn't have them matched up perfectly.
I just did a sort in Excel, and realized that maybe this would make more sense?
(Sorry, been up for 51 hours straight!)
Penn Treebank Tag Set Definition
Produced by OpenNLP API
Tag
Definition
Tag
NNS
Noun, plural
NNS
NP
Proper noun, singular
NNP
NPS
Proper noun, plural
NNPS
PP
Personal pronoun
PRP
PP$
Possessive pronoun
PRP$
Is that right?
-----Original Message-----
From: Jörn Kottmann [mailto:kottm...@gmail.com]
Sent: Tuesday, October 11, 2011 6:13 AM
To: opennlp-users@incubator.apache.org
Subject: EXTERNAL: Re: POS Tags
The English POS Model from the SourceForge download page
uses the Penn Treebank Tag Set.
Here is a link which list all tags:
http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQP-HTMLDemo/PennTreebankTS.html
Jörn
On 10/11/11 6:56 AM, Fotiadis, Konstantinos wrote:
I am looking around the definition and have not found the definitions for the
POS tags.
Can you help me with these?
Example:
"This is not a long sentence. I like turtles. Happiness is great!"
I then call SentenceDetectorME to detect sentences. Then loop through the
sentences and call Tokenizer on each one. I then pass the token String array to
POSTaggerME to get the POS. Here is my output:
Number of Sentences=3
SENTENCE_ID=1 - TOKENS=7 - This is not a long sentence.
TOKEN_ID=1 - POS=DT - This
TOKEN_ID=2 - POS=VBZ - is
TOKEN_ID=3 - POS=RB - not
TOKEN_ID=4 - POS=DT - a
TOKEN_ID=5 - POS=JJ - long
TOKEN_ID=6 - POS=NN - sentence
TOKEN_ID=7 - POS=. - .
SENTENCE_ID=2 - TOKENS=4 - I like turtles.
TOKEN_ID=1 - POS=PRP - I
TOKEN_ID=2 - POS=IN - like
TOKEN_ID=3 - POS=NNS - turtles
TOKEN_ID=4 - POS=. - .
SENTENCE_ID=3 - TOKENS=4 - Happiness is great!
TOKEN_ID=1 - POS=NNP - Happiness
TOKEN_ID=2 - POS=VBZ - is
TOKEN_ID=3 - POS=JJ - great
TOKEN_ID=4 - POS=. - !
Just curious of the definitions...
Thanks, Kosta