Re: Model to detect the gender

Damiano Porta Mon, 04 Jul 2016 06:07:06 -0700

Jorn, please, could you link me that model?

2016-07-04 14:42 GMT+02:00 Joern Kottmann <[email protected]>:


> The co-referencer we used used to have in opennlp-tools has a model to
> detect the gender of names. That could could be extracted and put into a
> stand alone component.
>
> Jörn
>
> On Mon, Jul 4, 2016 at 2:41 PM, Joern Kottmann <[email protected]> wrote:
>
> > I was speaking about the second case. We could build a dedicated
> component
> > specialized in extracting properties about already detected entities.
> >
> > Jörn
> >
> > On Mon, Jul 4, 2016 at 2:33 PM, Damiano Porta <[email protected]>
> > wrote:
> >
> >> Hello Jorn,
> >> Do you mean that i need to "extend" my NER model to find other
> >> name-related
> >> entities too?
> >>
> >> OR
> >>
> >> Find the entities with a dictionary and then train a maxent model that
> >> finds other properties like person title, job position etc?
> >>
> >> Thanks for the clarification.
> >>
> >>
> >> 2016-07-04 12:15 GMT+02:00 Joern Kottmann <[email protected]>:
> >>
> >> > Hello,
> >> >
> >> > there are also other interesting properties e.g. person title (e.g.
> >> > professor, doctor), job title/position,
> >> > company legal form. And much more for other entity types.
> >> >
> >> > Maybe it would be worth it to build a dedicated component to extract
> >> > properties from entities.
> >> >
> >> > Jörn
> >> >
> >> > On Fri, Jul 1, 2016 at 3:05 PM, Mondher Bouazizi <
> >> > [email protected]
> >> > > wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > Sorry for my late reply. I didn't understand well your last email,
> but
> >> > here
> >> > > is what I meant:
> >> > >
> >> > > Given a simple dictionary you have that has the following columns:
> >> > >
> >> > > Name           Type           Gender
> >> > > Agatha         First           F
> >> > > John            First           M
> >> > > Smith          Both           B
> >> > >
> >> > > where:
> >> > > - "First" refers to first name, "Last" (not in the example) refers
> to
> >> > last
> >> > > name, and Both means it can be both.
> >> > > - "F" refers to female, "M" refers to males, and "B" refers to both
> >> > > genders.
> >> > >
> >> > > and given the following two sentences:
> >> > >
> >> > > 1. "It was nice meeting you John. I hope we meet again soon."
> >> > >
> >> > > 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case
> and
> >> > felt
> >> > > she knows something"
> >> > >
> >> > > In the first example, when you check in the dictionary, the name
> >> "John"
> >> > is
> >> > > a male name, so no need to go any further.
> >> > > However, in the second example, the name "Smith", which is a family
> >> name
> >> > in
> >> > > our case, can be fit for both, males and females. Therefore, we need
> >> to
> >> > > extract features from the surrounding context and perform a
> >> > classification
> >> > > task.
> >> > > Here are some of the features I think they would be interesting to
> >> use:
> >> > >
> >> > > . Presence of a male initiative before the word {True, False}
> >> > > . Presence of a female initiative before the word {True, False}
> >> > >
> >> > > . Gender of the first personal pronoun (subject or object form) to
> the
> >> > > right of the name    Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> >> > > . Distance between the name and the first personal pronoun to the
> >> right
> >> > (in
> >> > > words)         Values=NUMERIC
> >> > > . Gender of the second personal pronoun to the right of the
> >> > > name                                 Values={MALE, FEMALE,
> UNCERTAIN,
> >> > > EMPTY}
> >> > > . Distance between the name and the second personal pronoun right
> >> > >                  Values=NUMERIC
> >> > > . Gender of the third personal pronoun to the right of the
> >> > > name                                      Values={MALE, FEMALE,
> >> > UNCERTAIN,
> >> > > EMPTY}
> >> > > . Distance between the name and the third personal pronoun right (in
> >> > > words)                  Values=NUMERIC
> >> > >
> >> > > . Gender of the first personal pronoun (subject or object form) to
> the
> >> > left
> >> > > of the name       Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> >> > > . Distance between the name and the first personal pronoun to the
> left
> >> > (in
> >> > > words)            Values=NUMERIC
> >> > > . Gender of the second personal pronoun to the left of the
> >> > > name                                    Values={MALE, FEMALE,
> >> UNCERTAIN,
> >> > > EMPTY}
> >> > > . Distance between the name and the second personal pronoun left
> >> > >                     Values=NUMERIC
> >> > > . Gender of the third personal pronoun to the left of the
> >> > > name                                        Values={MALE, FEMALE,
> >> > > UNCERTAIN, EMPTY}
> >> > > . Distance between the name and the third personal pronoun left (in
> >> > > words)                    Values=NUMERIC
> >> > >
> >> > > In the second example here are the values you have for your features
> >> > >
> >> > > F1 = False
> >> > > F2 = True
> >> > > F3 = UNCERTAIN
> >> > > F4 = 1
> >> > > F5 = FEMALE
> >> > > F6 = 3
> >> > > F7 = FEMALE
> >> > > F8 = 4
> >> > > F9 = UNCERTAIN
> >> > > F10 = 2
> >> > > F11 = EMPTY
> >> > > F12 = 0
> >> > > F13 = EMPTY
> >> > > F14 = 0
> >> > >
> >> > > Of course the choice of features depends on the type of data, and
> the
> >> > > features themselves might not work well for some texts such as ones
> >> > > collected from twitter for example.
> >> > >
> >> > > I hope this help you.
> >> > >
> >> > > Best regards
> >> > >
> >> > > Mondher
> >> > >
> >> > >
> >> > > On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta <
> >> [email protected]>
> >> > > wrote:
> >> > >
> >> > > > Hi Mondher,
> >> > > > could you give me a raw example to understand how i should train
> the
> >> > > > classifier model?
> >> > > >
> >> > > > Thank you in advance!
> >> > > > Damiano
> >> > > >
> >> > > >
> >> > > > 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi <
> >> [email protected]
> >> > >:
> >> > > >
> >> > > > > Hi,
> >> > > > >
> >> > > > > I would recommend a hybrid approach where, in a first step, you
> >> use a
> >> > > > plain
> >> > > > > dictionary and then perform the classification if needed.
> >> > > > >
> >> > > > > It's straightforward, but I think it would present better
> >> > performances
> >> > > > than
> >> > > > > just performing a classification task.
> >> > > > >
> >> > > > > In the first step you use a dictionary of names along with an
> >> > attribute
> >> > > > > specifying whether the name fits for males, females or both. In
> >> case
> >> > > the
> >> > > > > name fits for males or females exclusively, then no need to go
> any
> >> > > > further.
> >> > > > >
> >> > > > > If the name fits for both genders, or is a family name etc., a
> >> second
> >> > > > step
> >> > > > > is needed where you extract features from the context
> (surrounding
> >> > > words,
> >> > > > > etc.) and perform a classification task using any machine
> learning
> >> > > > > algorithm.
> >> > > > >
> >> > > > > Another way would be using the information itself (whether the
> >> name
> >> > > fits
> >> > > > > for males, females or both) as a feature when you perform the
> >> > > > > classification.
> >> > > > >
> >> > > > > Best regards,
> >> > > > >
> >> > > > > Mondher
> >> > > > >
> >> > > > > I am not sure
> >> > > > >
> >> > > > > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <
> >> > > [email protected]>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Awesome! Thank you so much WIlliam!
> >> > > > > >
> >> > > > > > 2016-06-29 13:36 GMT+02:00 William Colen <
> >> [email protected]
> >> > >:
> >> > > > > >
> >> > > > > > > To create a NER model OpenNLP extracts features from the
> >> context,
> >> > > > > things
> >> > > > > > > such as: word prefix and suffix, next word, previous word,
> >> > previous
> >> > > > > word
> >> > > > > > > prefix and suffix, next word prefix and suffix etc.
> >> > > > > > > When you don't configure the feature generator it will apply
> >> the
> >> > > > > default:
> >> > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> >> > > > > > >
> >> > > > > > > Default feature generator:
> >> > > > > > >
> >> > > > > > > AdaptiveFeatureGenerator featureGenerator = *new*
> >> > > > > CachedFeatureGenerator(
> >> > > > > > >          *new* AdaptiveFeatureGenerator[]{
> >> > > > > > >            *new* WindowFeatureGenerator(*new*
> >> > > > TokenFeatureGenerator(),
> >> > > > > 2,
> >> > > > > > > 2),
> >> > > > > > >            *new* WindowFeatureGenerator(*new*
> >> > > > > > > TokenClassFeatureGenerator(true), 2, 2),
> >> > > > > > >            *new* OutcomePriorFeatureGenerator(),
> >> > > > > > >            *new* PreviousMapFeatureGenerator(),
> >> > > > > > >            *new* BigramNameFeatureGenerator(),
> >> > > > > > >            *new* SentenceFeatureGenerator(true, false)
> >> > > > > > >            });
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > These default features should work for most cases (specially
> >> > > > English),
> >> > > > > > but
> >> > > > > > > they of course can be incremented. If you do so, your model
> >> will
> >> > > take
> >> > > > > new
> >> > > > > > > features in account. So yes, you are putting the features in
> >> your
> >> > > > > model.
> >> > > > > > >
> >> > > > > > > To configure custom features is not easy. I would start with
> >> the
> >> > > > > default
> >> > > > > > > and use 10-fold cross-validation and take notes of its
> >> > > effectiveness.
> >> > > > > > Than
> >> > > > > > > change/add a feature, evaluate and take notes. Sometimes a
> >> > feature
> >> > > > that
> >> > > > > > we
> >> > > > > > > are sure would help can destroy the model effectiveness.
> >> > > > > > >
> >> > > > > > > Regards
> >> > > > > > > William
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta <
> >> [email protected]
> >> > >:
> >> > > > > > >
> >> > > > > > > > Thank you William! Really appreciated!
> >> > > > > > > >
> >> > > > > > > > I only do not get one point, when you said "You could
> >> increment
> >> > > > your
> >> > > > > > > > model using
> >> > > > > > > > Custom Feature Generators" does it mean that i can "put"
> >> these
> >> > > > > features
> >> > > > > > > > inside ONE *.bin* file (model) that implement different
> >> things,
> >> > > or,
> >> > > > > > name
> >> > > > > > > > finder is one thing and those feature generators other?
> >> > > > > > > >
> >> > > > > > > > Thank you in advance for the clarification.
> >> > > > > > > >
> >> > > > > > > > 2016-06-29 1:23 GMT+02:00 William Colen <
> >> > [email protected]
> >> > > >:
> >> > > > > > > >
> >> > > > > > > > > Not exactly. You would create a new NER model to replace
> >> > yours.
> >> > > > > > > > >
> >> > > > > > > > > In this approach you would need a corpus like this:
> >> > > > > > > > >
> >> > > > > > > > > <START:personMale> Pierre Vinken <END> , 61 years old ,
> >> will
> >> > > join
> >> > > > > the
> >> > > > > > > > board
> >> > > > > > > > > as a nonexecutive director Nov. 29 .
> >> > > > > > > > > Mr . <START:personMale> Vinken <END> is chairman of
> >> Elsevier
> >> > > > N.V. ,
> >> > > > > > the
> >> > > > > > > > > Dutch publishing group . <START:personFemale> Jessie
> >> Robson
> >> > > <END>
> >> > > > > is
> >> > > > > > > > > retiring , she was a board member for 5 years .
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > I am not an English native speaker, so I am not sure if
> >> the
> >> > > > example
> >> > > > > > is
> >> > > > > > > > > clear enough. I tried to use Jessie as a neutral name
> and
> >> > "she"
> >> > > > as
> >> > > > > > > > > disambiguation.
> >> > > > > > > > >
> >> > > > > > > > > With a corpus big enough maybe you could create a model
> >> that
> >> > > > > outputs
> >> > > > > > > both
> >> > > > > > > > > classes, personMale and personFemale. To train a model
> you
> >> > can
> >> > > > > follow
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> >> > > > > > > > >
> >> > > > > > > > > Let's say your results are not good enough. You could
> >> > increment
> >> > > > > your
> >> > > > > > > > model
> >> > > > > > > > > using Custom Feature Generators (
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> >> > > > > > > > > and
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> >> > > > > > > > > ).
> >> > > > > > > > >
> >> > > > > > > > > One of the implemented featuregen can take a dictionary
> (
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> >> > > > > > > > > ).
> >> > > > > > > > > You can also implement other convenient
> FeatureGenerator,
> >> for
> >> > > > > > instance
> >> > > > > > > > > regex.
> >> > > > > > > > >
> >> > > > > > > > > Again, it is just a wild guess of how to implement it. I
> >> > don't
> >> > > > know
> >> > > > > > if
> >> > > > > > > it
> >> > > > > > > > > would perform well. I was only thinking how to
> implement a
> >> > > gender
> >> > > > > ML
> >> > > > > > > > model
> >> > > > > > > > > that uses the surrounding context.
> >> > > > > > > > >
> >> > > > > > > > > Hope I could clarify.
> >> > > > > > > > >
> >> > > > > > > > > William
> >> > > > > > > > >
> >> > > > > > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <
> >> > > [email protected]
> >> > > > >:
> >> > > > > > > > >
> >> > > > > > > > > > Hi William,
> >> > > > > > > > > > Ok, so you are talking about a kind of pipe where we
> >> > execute:
> >> > > > > > > > > >
> >> > > > > > > > > > 1. NER (personM for example)
> >> > > > > > > > > > 2. Regex (filter to reduce false positives)
> >> > > > > > > > > > 3. Plain dictionary (filter as above) ?
> >> > > > > > > > > >
> >> > > > > > > > > > Yes we can split out model in two for M and F, it is
> >> not a
> >> > > big
> >> > > > > > > problem,
> >> > > > > > > > > we
> >> > > > > > > > > > have a database grouped by gender.
> >> > > > > > > > > >
> >> > > > > > > > > > I only have a doubt regarding the use of a dictionary.
> >> > > Because
> >> > > > if
> >> > > > > > we
> >> > > > > > > > use
> >> > > > > > > > > a
> >> > > > > > > > > > dictionary to create the model, we could only use it
> to
> >> > > detect
> >> > > > > > names
> >> > > > > > > > > > without using NER. No?
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > 2016-06-29 0:10 GMT+02:00 William Colen <
> >> > > > [email protected]
> >> > > > > >:
> >> > > > > > > > > >
> >> > > > > > > > > > > Do you plan to use the surrounding context? If yes,
> >> maybe
> >> > > you
> >> > > > > > could
> >> > > > > > > > try
> >> > > > > > > > > > to
> >> > > > > > > > > > > split NER in two categories: PersonM and PersonF.
> >> Just an
> >> > > > idea,
> >> > > > > > > never
> >> > > > > > > > > > read
> >> > > > > > > > > > > or tried anything like it. You would need a training
> >> > corpus
> >> > > > > with
> >> > > > > > > > these
> >> > > > > > > > > > > classes.
> >> > > > > > > > > > >
> >> > > > > > > > > > > You could add both the plain dictionary and the
> regex
> >> as
> >> > > NER
> >> > > > > > > features
> >> > > > > > > > > as
> >> > > > > > > > > > > well and check how it improves.
> >> > > > > > > > > > >
> >> > > > > > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <
> >> > > > > [email protected]
> >> > > > > > >:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Hello everybody,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > we built a NER model to find persons (name) inside
> >> our
> >> > > > > > documents.
> >> > > > > > > > > > > > We are looking for the best approach to understand
> >> if
> >> > the
> >> > > > > name
> >> > > > > > is
> >> > > > > > > > > > > > male/female.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Possible solutions:
> >> > > > > > > > > > > > - Plain dictionary?
> >> > > > > > > > > > > > - Regex to check the initial and/letters of the
> >> name?
> >> > > > > > > > > > > > - Classifier? (naive bayes? Maxent?)
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Thanks
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Model to detect the gender

Reply via email to