Hi,

I would recommend a hybrid approach where, in a first step, you use a plain
dictionary and then perform the classification if needed.

It's straightforward, but I think it would present better performances than
just performing a classification task.

In the first step you use a dictionary of names along with an attribute
specifying whether the name fits for males, females or both. In case the
name fits for males or females exclusively, then no need to go any further.

If the name fits for both genders, or is a family name etc., a second step
is needed where you extract features from the context (surrounding words,
etc.) and perform a classification task using any machine learning
algorithm.

Another way would be using the information itself (whether the name fits
for males, females or both) as a feature when you perform the
classification.

Best regards,

Mondher

I am not sure

On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <[email protected]>
wrote:

> Awesome! Thank you so much WIlliam!
>
> 2016-06-29 13:36 GMT+02:00 William Colen <[email protected]>:
>
> > To create a NER model OpenNLP extracts features from the context, things
> > such as: word prefix and suffix, next word, previous word, previous word
> > prefix and suffix, next word prefix and suffix etc.
> > When you don't configure the feature generator it will apply the default:
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> >
> > Default feature generator:
> >
> > AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator(
> >          *new* AdaptiveFeatureGenerator[]{
> >            *new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2,
> > 2),
> >            *new* WindowFeatureGenerator(*new*
> > TokenClassFeatureGenerator(true), 2, 2),
> >            *new* OutcomePriorFeatureGenerator(),
> >            *new* PreviousMapFeatureGenerator(),
> >            *new* BigramNameFeatureGenerator(),
> >            *new* SentenceFeatureGenerator(true, false)
> >            });
> >
> >
> > These default features should work for most cases (specially English),
> but
> > they of course can be incremented. If you do so, your model will take new
> > features in account. So yes, you are putting the features in your model.
> >
> > To configure custom features is not easy. I would start with the default
> > and use 10-fold cross-validation and take notes of its effectiveness.
> Than
> > change/add a feature, evaluate and take notes. Sometimes a feature that
> we
> > are sure would help can destroy the model effectiveness.
> >
> > Regards
> > William
> >
> >
> > 2016-06-29 7:00 GMT-03:00 Damiano Porta <[email protected]>:
> >
> > > Thank you William! Really appreciated!
> > >
> > > I only do not get one point, when you said "You could increment your
> > > model using
> > > Custom Feature Generators" does it mean that i can "put" these features
> > > inside ONE *.bin* file (model) that implement different things, or,
> name
> > > finder is one thing and those feature generators other?
> > >
> > > Thank you in advance for the clarification.
> > >
> > > 2016-06-29 1:23 GMT+02:00 William Colen <[email protected]>:
> > >
> > > > Not exactly. You would create a new NER model to replace yours.
> > > >
> > > > In this approach you would need a corpus like this:
> > > >
> > > > <START:personMale> Pierre Vinken <END> , 61 years old , will join the
> > > board
> > > > as a nonexecutive director Nov. 29 .
> > > > Mr . <START:personMale> Vinken <END> is chairman of Elsevier N.V. ,
> the
> > > > Dutch publishing group . <START:personFemale> Jessie Robson <END> is
> > > > retiring , she was a board member for 5 years .
> > > >
> > > >
> > > > I am not an English native speaker, so I am not sure if the example
> is
> > > > clear enough. I tried to use Jessie as a neutral name and "she" as
> > > > disambiguation.
> > > >
> > > > With a corpus big enough maybe you could create a model that outputs
> > both
> > > > classes, personMale and personFemale. To train a model you can follow
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > > >
> > > > Let's say your results are not good enough. You could increment your
> > > model
> > > > using Custom Feature Generators (
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > > and
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > > ).
> > > >
> > > > One of the implemented featuregen can take a dictionary (
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > > ).
> > > > You can also implement other convenient FeatureGenerator, for
> instance
> > > > regex.
> > > >
> > > > Again, it is just a wild guess of how to implement it. I don't know
> if
> > it
> > > > would perform well. I was only thinking how to implement a gender ML
> > > model
> > > > that uses the surrounding context.
> > > >
> > > > Hope I could clarify.
> > > >
> > > > William
> > > >
> > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <[email protected]>:
> > > >
> > > > > Hi William,
> > > > > Ok, so you are talking about a kind of pipe where we execute:
> > > > >
> > > > > 1. NER (personM for example)
> > > > > 2. Regex (filter to reduce false positives)
> > > > > 3. Plain dictionary (filter as above) ?
> > > > >
> > > > > Yes we can split out model in two for M and F, it is not a big
> > problem,
> > > > we
> > > > > have a database grouped by gender.
> > > > >
> > > > > I only have a doubt regarding the use of a dictionary. Because if
> we
> > > use
> > > > a
> > > > > dictionary to create the model, we could only use it to detect
> names
> > > > > without using NER. No?
> > > > >
> > > > >
> > > > >
> > > > > 2016-06-29 0:10 GMT+02:00 William Colen <[email protected]>:
> > > > >
> > > > > > Do you plan to use the surrounding context? If yes, maybe you
> could
> > > try
> > > > > to
> > > > > > split NER in two categories: PersonM and PersonF. Just an idea,
> > never
> > > > > read
> > > > > > or tried anything like it. You would need a training corpus with
> > > these
> > > > > > classes.
> > > > > >
> > > > > > You could add both the plain dictionary and the regex as NER
> > features
> > > > as
> > > > > > well and check how it improves.
> > > > > >
> > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <[email protected]
> >:
> > > > > >
> > > > > > > Hello everybody,
> > > > > > >
> > > > > > > we built a NER model to find persons (name) inside our
> documents.
> > > > > > > We are looking for the best approach to understand if the name
> is
> > > > > > > male/female.
> > > > > > >
> > > > > > > Possible solutions:
> > > > > > > - Plain dictionary?
> > > > > > > - Regex to check the initial and/letters of the name?
> > > > > > > - Classifier? (naive bayes? Maxent?)
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to