Awesome! Thank you so much WIlliam! 2016-06-29 13:36 GMT+02:00 William Colen <william.co...@gmail.com>:
> To create a NER model OpenNLP extracts features from the context, things > such as: word prefix and suffix, next word, previous word, previous word > prefix and suffix, next word prefix and suffix etc. > When you don't configure the feature generator it will apply the default: > > https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api > > Default feature generator: > > AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator( > *new* AdaptiveFeatureGenerator[]{ > *new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2, > 2), > *new* WindowFeatureGenerator(*new* > TokenClassFeatureGenerator(true), 2, 2), > *new* OutcomePriorFeatureGenerator(), > *new* PreviousMapFeatureGenerator(), > *new* BigramNameFeatureGenerator(), > *new* SentenceFeatureGenerator(true, false) > }); > > > These default features should work for most cases (specially English), but > they of course can be incremented. If you do so, your model will take new > features in account. So yes, you are putting the features in your model. > > To configure custom features is not easy. I would start with the default > and use 10-fold cross-validation and take notes of its effectiveness. Than > change/add a feature, evaluate and take notes. Sometimes a feature that we > are sure would help can destroy the model effectiveness. > > Regards > William > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta <damianopo...@gmail.com>: > > > Thank you William! Really appreciated! > > > > I only do not get one point, when you said "You could increment your > > model using > > Custom Feature Generators" does it mean that i can "put" these features > > inside ONE *.bin* file (model) that implement different things, or, name > > finder is one thing and those feature generators other? > > > > Thank you in advance for the clarification. > > > > 2016-06-29 1:23 GMT+02:00 William Colen <william.co...@gmail.com>: > > > > > Not exactly. You would create a new NER model to replace yours. > > > > > > In this approach you would need a corpus like this: > > > > > > <START:personMale> Pierre Vinken <END> , 61 years old , will join the > > board > > > as a nonexecutive director Nov. 29 . > > > Mr . <START:personMale> Vinken <END> is chairman of Elsevier N.V. , the > > > Dutch publishing group . <START:personFemale> Jessie Robson <END> is > > > retiring , she was a board member for 5 years . > > > > > > > > > I am not an English native speaker, so I am not sure if the example is > > > clear enough. I tried to use Jessie as a neutral name and "she" as > > > disambiguation. > > > > > > With a corpus big enough maybe you could create a model that outputs > both > > > classes, personMale and personFemale. To train a model you can follow > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training > > > > > > Let's say your results are not good enough. You could increment your > > model > > > using Custom Feature Generators ( > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen > > > and > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html > > > ). > > > > > > One of the implemented featuregen can take a dictionary ( > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html > > > ). > > > You can also implement other convenient FeatureGenerator, for instance > > > regex. > > > > > > Again, it is just a wild guess of how to implement it. I don't know if > it > > > would perform well. I was only thinking how to implement a gender ML > > model > > > that uses the surrounding context. > > > > > > Hope I could clarify. > > > > > > William > > > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <damianopo...@gmail.com>: > > > > > > > Hi William, > > > > Ok, so you are talking about a kind of pipe where we execute: > > > > > > > > 1. NER (personM for example) > > > > 2. Regex (filter to reduce false positives) > > > > 3. Plain dictionary (filter as above) ? > > > > > > > > Yes we can split out model in two for M and F, it is not a big > problem, > > > we > > > > have a database grouped by gender. > > > > > > > > I only have a doubt regarding the use of a dictionary. Because if we > > use > > > a > > > > dictionary to create the model, we could only use it to detect names > > > > without using NER. No? > > > > > > > > > > > > > > > > 2016-06-29 0:10 GMT+02:00 William Colen <william.co...@gmail.com>: > > > > > > > > > Do you plan to use the surrounding context? If yes, maybe you could > > try > > > > to > > > > > split NER in two categories: PersonM and PersonF. Just an idea, > never > > > > read > > > > > or tried anything like it. You would need a training corpus with > > these > > > > > classes. > > > > > > > > > > You could add both the plain dictionary and the regex as NER > features > > > as > > > > > well and check how it improves. > > > > > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <damianopo...@gmail.com>: > > > > > > > > > > > Hello everybody, > > > > > > > > > > > > we built a NER model to find persons (name) inside our documents. > > > > > > We are looking for the best approach to understand if the name is > > > > > > male/female. > > > > > > > > > > > > Possible solutions: > > > > > > - Plain dictionary? > > > > > > - Regex to check the initial and/letters of the name? > > > > > > - Classifier? (naive bayes? Maxent?) > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > >