Re: Surronding tokens of the entity on MaxEnt models

Russ, Daniel (NIH/CIT) [E] Mon, 02 May 2016 06:48:09 -0700

Of course you can use regex patterns, but it gets pretty complicated. See: 
https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf,  
Christopher Manning uses the example of a word that ends in “c” as a feature 
for the class drug.  That could be a regex feature. you could also have a regex 
pattern, but you need to be very specific. i.e. word (\w+name\+:?) keep in mind 
that If you are using creating a feature generator, then this works on one word 
(actually a token) you would have to play games to transform the 
(String[])tokens back to a string.


looking at your data, i would consider setting a feature “ends with :” and 
“ends with ,” it appears that the previous word often ends with : and the 
current words often ends with a comma.  Of course check that your tokenizer 
does not separate the punctuation from the word.  You’ll have to see it it 
works or not.

Hope it helps.
Daniel


On May 2, 2016, at 9:31 AM, Damiano Porta 
<[email protected]<mailto:[email protected]>> wrote:

Hi Daniel! Thank you so much!

Unfortunately, I am not sure. I really do not know what is the best way in
this case.
I have a dataset with patterns like:

my name is {name}, from {location}
name: {name}
full name: {name}
I am {name}, i was born in {location}

etc etc etc

I could use regexes too. Maybe i list of patterns that i can loop for each
document. What do you think? I do not know if i can build a training set
with those example (i have around 100 different patterns).
How can i create those features with my patterns?

Thank you in advance!


2016-05-02 15:19 GMT+02:00 Russ, Daniel (NIH/CIT) [E] 
<[email protected]<mailto:[email protected]>>:

Hi Damiano,

    Why are you so sure that your model with not work?  A couple of
things to remember, 1. you need quite a bit of training data.  Two
sentences does not make a training set.  2. You probably need more than a
window of words as your features.  However, you can see that word-2=“name"
and word-1=“is” tend to precede a name.  Look into other potential features
and get a larger dataset and your results may surprise you.

Daniel


On May 1, 2016, at 3:13 PM, Jeffrey Zemerick 
<[email protected]<mailto:[email protected]><mailto:
[email protected]<mailto:[email protected]>>> wrote:

I'm sure the others on this list can give you a more complete answer so I
will try to not lead you astray.

The WindowFeatureGenerator is only one of the available feature generators.
There are many classes that implement the AdaptiveFeatureGenerator
interface [1] and you can, of course, provide your own implementation of
that interface to support additional features. For example, the
SentenceFeatureGenerator [2] looks at the beginning and end of each
training sentence. So to answer your question, the length of the training
sentence should not matter - what matters is if the combination of
configured feature generators used can provide a model that accurately
describes the training text.

Jeff

[1]

https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html
[2]

https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/SentenceFeatureGenerator.html


On Sun, May 1, 2016 at 12:02 PM, Damiano Porta <[email protected]>
wrote:

Hi Jeff!
Thank you so much for your fast reply.

I have a doubt, let suppose we use this feature with a window of:

2 tokens on the left + *ENTITY* + 2 tokens on the right

The doubt is how can i train the model correctly?

if only the previous 2 tokens and the next 2 tokens matters i should not
use long sentences to training the model. Right?

For example (person-model.train):

1. I am <START:person> Barack <END> and I am the president of USA

2. My name is <START:person> Barack <END> and my surname is Obama

...

Those are two stupid training samples, it is just to let you know my doubt.

In this case i should have:

*I am Barack and I*

*name is Barack and my*

the others tokens (left and right) do not matter. So the sentences on my
training set should be very short, right? Basically I should only define
all the "combinations" of the previous/next 2 tokens, right?

Thank you!
Damiano



2016-05-01 16:07 GMT+02:00 Jeffrey Zemerick <[email protected]>:

I think you are looking for the WindowFeatureGenerator [1]. You can set
the
size of the window by specifying the number of previous tokens and number
of next tokens.

Jeff

[1]



https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html


On Sun, May 1, 2016 at 5:16 AM, Damiano Porta <[email protected]>
wrote:

Hello everybody
How many surrounding tokens are kept into account to find the entity
using
a maxent model?
Basically a maxent model should detect an entity looking at the
surronding
tokens, right ?
I would like to understand if:

1. can i set the number of tokens on the left side?
2. can i set the number of tokens on the right side too ?

Thank you in advance for the clarification
Best

Damiano

Re: Surronding tokens of the entity on MaxEnt models

Reply via email to