Re: Model to detect the gender

2016-06-29 Thread Mondher Bouazizi
Hi,

I would recommend a hybrid approach where, in a first step, you use a plain
dictionary and then perform the classification if needed.

It's straightforward, but I think it would present better performances than
just performing a classification task.

In the first step you use a dictionary of names along with an attribute
specifying whether the name fits for males, females or both. In case the
name fits for males or females exclusively, then no need to go any further.

If the name fits for both genders, or is a family name etc., a second step
is needed where you extract features from the context (surrounding words,
etc.) and perform a classification task using any machine learning
algorithm.

Another way would be using the information itself (whether the name fits
for males, females or both) as a feature when you perform the
classification.

Best regards,

Mondher

I am not sure

On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta 
wrote:

> Awesome! Thank you so much WIlliam!
>
> 2016-06-29 13:36 GMT+02:00 William Colen :
>
> > To create a NER model OpenNLP extracts features from the context, things
> > such as: word prefix and suffix, next word, previous word, previous word
> > prefix and suffix, next word prefix and suffix etc.
> > When you don't configure the feature generator it will apply the default:
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> >
> > Default feature generator:
> >
> > AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator(
> >  *new* AdaptiveFeatureGenerator[]{
> >*new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2,
> > 2),
> >*new* WindowFeatureGenerator(*new*
> > TokenClassFeatureGenerator(true), 2, 2),
> >*new* OutcomePriorFeatureGenerator(),
> >*new* PreviousMapFeatureGenerator(),
> >*new* BigramNameFeatureGenerator(),
> >*new* SentenceFeatureGenerator(true, false)
> >});
> >
> >
> > These default features should work for most cases (specially English),
> but
> > they of course can be incremented. If you do so, your model will take new
> > features in account. So yes, you are putting the features in your model.
> >
> > To configure custom features is not easy. I would start with the default
> > and use 10-fold cross-validation and take notes of its effectiveness.
> Than
> > change/add a feature, evaluate and take notes. Sometimes a feature that
> we
> > are sure would help can destroy the model effectiveness.
> >
> > Regards
> > William
> >
> >
> > 2016-06-29 7:00 GMT-03:00 Damiano Porta :
> >
> > > Thank you William! Really appreciated!
> > >
> > > I only do not get one point, when you said "You could increment your
> > > model using
> > > Custom Feature Generators" does it mean that i can "put" these features
> > > inside ONE *.bin* file (model) that implement different things, or,
> name
> > > finder is one thing and those feature generators other?
> > >
> > > Thank you in advance for the clarification.
> > >
> > > 2016-06-29 1:23 GMT+02:00 William Colen :
> > >
> > > > Not exactly. You would create a new NER model to replace yours.
> > > >
> > > > In this approach you would need a corpus like this:
> > > >
> > > >  Pierre Vinken  , 61 years old , will join the
> > > board
> > > > as a nonexecutive director Nov. 29 .
> > > > Mr .  Vinken  is chairman of Elsevier N.V. ,
> the
> > > > Dutch publishing group .  Jessie Robson  is
> > > > retiring , she was a board member for 5 years .
> > > >
> > > >
> > > > I am not an English native speaker, so I am not sure if the example
> is
> > > > clear enough. I tried to use Jessie as a neutral name and "she" as
> > > > disambiguation.
> > > >
> > > > With a corpus big enough maybe you could create a model that outputs
> > both
> > > > classes, personMale and personFemale. To train a model you can follow
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > > >
> > > > Let's say your results are not good enough. You could increment your
> > > model
> > > > using Custom Feature Generators (
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > > and
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > > ).
> > > >
> > > > One of the implemented featuregen can take a dictionary (
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > > ).
> > > > You can also implement other convenient FeatureGenerator, for
> instance
> > > > regex.
> > > >
> > > > Again, it is just a wild guess of how to implement 

Re: DeepLearning4J as a ML for OpenNLP

2016-06-29 Thread Boris Galitsky
Hi Anthony


  My interest lies in the question you raised - how to machine learn the 
structure of a paragraph (not a document yet), given parse trees of individual 
sentences.


Doc2vec  is one direction, but my personal preference is more explicit, 
structure-based. In my opinion, deep learning family of approaches leverages a 
huge training dataset they train from, but lacks representing of logical 
structure of a given document. On the other hand, a discourse tree of a 
paragraph is a good way to link individual parse trees in a structure to  
represent a paragraph of text, but lacks extensive knowledge for how n-grams 
form "meanings" in documents. Therefore I believe doc2vec and learning of 
discourse trees complement each other.


To systematically learn discourse tree in addition to parse trees, we use tree 
kernel learning. It forms the space of all sub-trees of trees with abstract 
labels, and does SVM learning in it. We combine regular parse trees and links 
between sentences such as rhetoric relations.


The application areas are:

- answering multi-sentence questions

- document-level classification, text style recognition, e.g. for security 
domain - where documents include the same words but need to be classified by 
style.

- content generation where maintaining of rhetoric structure is important.


>The generalizations could hurt the classification performance in some
>tasks, but seem to be more useful when the target documents are larger.


Yes, in this case discourse trees are less plausible.


>It could also be possible to chose the "document" to be a single word as
>well, reducing the underlying matrix to an array, does that make sense?


Have not thought about it / have not tried it either


>Therefore, we could also use document based vectors for mid to high-layer
>tasks (doc cat, sentiment, profile etc..). What do you think?


I think for doc classification yes, for sentiments - I am more skeptical, 
although it is the evaluation area of (Mikolov et al).




Do you have a particular problem in mind?


I can share code on git / papers on the above.


Another way to look at deep learning for NLP : deep learning kind of takes 
science away from linguistics and makes it more like engineering, I am not sure 
it is a direction for openNLP?


Regards

Boris


From: Anthony Beylerian 
Sent: Wednesday, June 29, 2016 11:24:02 AM
To: dev@opennlp.apache.org
Subject: Re: DeepLearning4J as a ML for OpenNLP

Hi Boris,

Thank you very much for sharing your experience with us!
Is it possible to ask you for more information?

I have only just recently used d4lj with some introductory material,
however I have also felt doc2vec could also be quite useful, although my
understanding of it is still limited.

My current understanding is that doc2vec as an extension of word2vec, can
capture a more generalized context (the document) instead of just focusing
on the context of a single word, in order to provide features useful to
classify that document.

The advantage would be to better capture latent information that exist in
the document (such as the order of words), instead of just averaging word
vectors, or through other approaches on the document level (would love some
feedback on this)

The generalizations could hurt the classification performance in some
tasks, but seem to be more useful when the target documents are larger.

It could also be possible to chose the "document" to be a single word as
well, reducing the underlying matrix to an array, does that make sense?

Therefore, we could also use document based vectors for mid to high-layer
tasks (doc cat, sentiment, profile etc..). What do you think?

It would be fantastic to clarify, I believe that would also motivate more
people to pitch in and better assist with this.

Thanks,

Anthony
Hi William


I have never heard of Features2Vec.

I think for low-level tasks, pre-linguistic tasks such as text
classification where we don't want to build models and have a one-fits-all
solution, Word2Vec works well. I used it in industrial environment for text
classification, some information extraction and content generation tasks.
So I think it should also work for low-level OpenNLP tasks.


Regards

Boris



From: William Colen 
Sent: Wednesday, June 29, 2016 4:43:25 AM
To: dev@opennlp.apache.org
Subject: Re: DeepLearning4J as a ML for OpenNLP

Thank you, Boris. I am new to DeepLearning, so I have no idea the issues we
would face. I was wondering if we can use Features2Vec instead of Word2Vec,
does it make any sense?
The idea was to use DL in low level NLP tasks where we don't have parse
trees yet.


2016-06-29 6:34 GMT-03:00 Boris Galitsky :

> Hi guys
>
>   I should mention how we used DeepLearning4J for the OpenNLP.Similarity
> project at
>
> https://github.com/bgalitsky/relevance-based-on-parse-trees
>
>
> The main 

Re: DeepLearning4J as a ML for OpenNLP

2016-06-29 Thread Anthony Beylerian
Hi Boris,

Thank you very much for sharing your experience with us!
Is it possible to ask you for more information?

I have only just recently used d4lj with some introductory material,
however I have also felt doc2vec could also be quite useful, although my
understanding of it is still limited.

My current understanding is that doc2vec as an extension of word2vec, can
capture a more generalized context (the document) instead of just focusing
on the context of a single word, in order to provide features useful to
classify that document.

The advantage would be to better capture latent information that exist in
the document (such as the order of words), instead of just averaging word
vectors, or through other approaches on the document level (would love some
feedback on this)

The generalizations could hurt the classification performance in some
tasks, but seem to be more useful when the target documents are larger.

It could also be possible to chose the "document" to be a single word as
well, reducing the underlying matrix to an array, does that make sense?

Therefore, we could also use document based vectors for mid to high-layer
tasks (doc cat, sentiment, profile etc..). What do you think?

It would be fantastic to clarify, I believe that would also motivate more
people to pitch in and better assist with this.

Thanks,

Anthony
Hi William


I have never heard of Features2Vec.

I think for low-level tasks, pre-linguistic tasks such as text
classification where we don't want to build models and have a one-fits-all
solution, Word2Vec works well. I used it in industrial environment for text
classification, some information extraction and content generation tasks.
So I think it should also work for low-level OpenNLP tasks.


Regards

Boris



From: William Colen 
Sent: Wednesday, June 29, 2016 4:43:25 AM
To: dev@opennlp.apache.org
Subject: Re: DeepLearning4J as a ML for OpenNLP

Thank you, Boris. I am new to DeepLearning, so I have no idea the issues we
would face. I was wondering if we can use Features2Vec instead of Word2Vec,
does it make any sense?
The idea was to use DL in low level NLP tasks where we don't have parse
trees yet.


2016-06-29 6:34 GMT-03:00 Boris Galitsky :

> Hi guys
>
>   I should mention how we used DeepLearning4J for the OpenNLP.Similarity
> project at
>
> https://github.com/bgalitsky/relevance-based-on-parse-trees
>
>
> The main question is how word2vec models and linguistic information such
> as part trees complement each other. In a word2vec approach any two words
> can be compared. The weakness here is that when learning is based on
> computing a distance between totally unrelated words like 'cat' and 'fly'
> can be meaningless, uninformative and can corrupt a learning model.
>
>
> In OpenNLP.Similarity component similarity is defined  in terms of parse
> trees. When word2vec is applied on top of parse trees and not as a
> bag-of-words, we only compute the distance between the words with the same
> semantic role, so the model becomes more accurate.
>
>
> There's a paper on the way which does the assessment of relevance
> improvent for
>
>
> word2vec (bag-of-words) [traditional] vs word2vec (parse-trees)
>
>
> Regards
>
> Boris
>
> [https://avatars3.githubusercontent.com/u/1051120?v=3=400]<
> https://github.com/bgalitsky/relevance-based-on-parse-trees>
>
> bgalitsky/relevance-based-on-parse-trees<
> https://github.com/bgalitsky/relevance-based-on-parse-trees>
> github.com
> Automatically exported from
> code.google.com/p/relevance-based-on-parse-trees
>
>
>
>
> 
> From: Anthony Beylerian 
> Sent: Wednesday, June 29, 2016 2:13:38 AM
> To: dev@opennlp.apache.org
> Subject: Re: DeepLearning4J as a ML for OpenNLP
>
> +1 would be willing to help out when possible
>


Re: Model to detect the gender

2016-06-29 Thread Damiano Porta
Awesome! Thank you so much WIlliam!

2016-06-29 13:36 GMT+02:00 William Colen :

> To create a NER model OpenNLP extracts features from the context, things
> such as: word prefix and suffix, next word, previous word, previous word
> prefix and suffix, next word prefix and suffix etc.
> When you don't configure the feature generator it will apply the default:
>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
>
> Default feature generator:
>
> AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator(
>  *new* AdaptiveFeatureGenerator[]{
>*new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2,
> 2),
>*new* WindowFeatureGenerator(*new*
> TokenClassFeatureGenerator(true), 2, 2),
>*new* OutcomePriorFeatureGenerator(),
>*new* PreviousMapFeatureGenerator(),
>*new* BigramNameFeatureGenerator(),
>*new* SentenceFeatureGenerator(true, false)
>});
>
>
> These default features should work for most cases (specially English), but
> they of course can be incremented. If you do so, your model will take new
> features in account. So yes, you are putting the features in your model.
>
> To configure custom features is not easy. I would start with the default
> and use 10-fold cross-validation and take notes of its effectiveness. Than
> change/add a feature, evaluate and take notes. Sometimes a feature that we
> are sure would help can destroy the model effectiveness.
>
> Regards
> William
>
>
> 2016-06-29 7:00 GMT-03:00 Damiano Porta :
>
> > Thank you William! Really appreciated!
> >
> > I only do not get one point, when you said "You could increment your
> > model using
> > Custom Feature Generators" does it mean that i can "put" these features
> > inside ONE *.bin* file (model) that implement different things, or, name
> > finder is one thing and those feature generators other?
> >
> > Thank you in advance for the clarification.
> >
> > 2016-06-29 1:23 GMT+02:00 William Colen :
> >
> > > Not exactly. You would create a new NER model to replace yours.
> > >
> > > In this approach you would need a corpus like this:
> > >
> > >  Pierre Vinken  , 61 years old , will join the
> > board
> > > as a nonexecutive director Nov. 29 .
> > > Mr .  Vinken  is chairman of Elsevier N.V. , the
> > > Dutch publishing group .  Jessie Robson  is
> > > retiring , she was a board member for 5 years .
> > >
> > >
> > > I am not an English native speaker, so I am not sure if the example is
> > > clear enough. I tried to use Jessie as a neutral name and "she" as
> > > disambiguation.
> > >
> > > With a corpus big enough maybe you could create a model that outputs
> both
> > > classes, personMale and personFemale. To train a model you can follow
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > >
> > > Let's say your results are not good enough. You could increment your
> > model
> > > using Custom Feature Generators (
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > and
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > ).
> > >
> > > One of the implemented featuregen can take a dictionary (
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > ).
> > > You can also implement other convenient FeatureGenerator, for instance
> > > regex.
> > >
> > > Again, it is just a wild guess of how to implement it. I don't know if
> it
> > > would perform well. I was only thinking how to implement a gender ML
> > model
> > > that uses the surrounding context.
> > >
> > > Hope I could clarify.
> > >
> > > William
> > >
> > > 2016-06-28 19:15 GMT-03:00 Damiano Porta :
> > >
> > > > Hi William,
> > > > Ok, so you are talking about a kind of pipe where we execute:
> > > >
> > > > 1. NER (personM for example)
> > > > 2. Regex (filter to reduce false positives)
> > > > 3. Plain dictionary (filter as above) ?
> > > >
> > > > Yes we can split out model in two for M and F, it is not a big
> problem,
> > > we
> > > > have a database grouped by gender.
> > > >
> > > > I only have a doubt regarding the use of a dictionary. Because if we
> > use
> > > a
> > > > dictionary to create the model, we could only use it to detect names
> > > > without using NER. No?
> > > >
> > > >
> > > >
> > > > 2016-06-29 0:10 GMT+02:00 William Colen :
> > > >
> > > > > Do you plan to use the surrounding context? If yes, maybe you could
> > try
> > > > to
> > > > > split NER in two categories: PersonM and PersonF. Just an idea,
> never
> > > > read
> > > > > or tried anything 

Re: DeepLearning4J as a ML for OpenNLP

2016-06-29 Thread Anthony Beylerian
There's also Doc2vec ::

http://deeplearning4j.org/doc2vec.html

Which could work as well.

On Wed, Jun 29, 2016 at 8:43 PM, William Colen 
wrote:

> Thank you, Boris. I am new to DeepLearning, so I have no idea the issues we
> would face. I was wondering if we can use Features2Vec instead of Word2Vec,
> does it make any sense?
> The idea was to use DL in low level NLP tasks where we don't have parse
> trees yet.
>
>
> 2016-06-29 6:34 GMT-03:00 Boris Galitsky :
>
> > Hi guys
> >
> >   I should mention how we used DeepLearning4J for the OpenNLP.Similarity
> > project at
> >
> > https://github.com/bgalitsky/relevance-based-on-parse-trees
> >
> >
> > The main question is how word2vec models and linguistic information such
> > as part trees complement each other. In a word2vec approach any two words
> > can be compared. The weakness here is that when learning is based on
> > computing a distance between totally unrelated words like 'cat' and 'fly'
> > can be meaningless, uninformative and can corrupt a learning model.
> >
> >
> > In OpenNLP.Similarity component similarity is defined  in terms of parse
> > trees. When word2vec is applied on top of parse trees and not as a
> > bag-of-words, we only compute the distance between the words with the
> same
> > semantic role, so the model becomes more accurate.
> >
> >
> > There's a paper on the way which does the assessment of relevance
> > improvent for
> >
> >
> > word2vec (bag-of-words) [traditional] vs word2vec (parse-trees)
> >
> >
> > Regards
> >
> > Boris
> >
> > [https://avatars3.githubusercontent.com/u/1051120?v=3=400]<
> > https://github.com/bgalitsky/relevance-based-on-parse-trees>
> >
> > bgalitsky/relevance-based-on-parse-trees<
> > https://github.com/bgalitsky/relevance-based-on-parse-trees>
> > github.com
> > Automatically exported from
> > code.google.com/p/relevance-based-on-parse-trees
> >
> >
> >
> >
> > 
> > From: Anthony Beylerian 
> > Sent: Wednesday, June 29, 2016 2:13:38 AM
> > To: dev@opennlp.apache.org
> > Subject: Re: DeepLearning4J as a ML for OpenNLP
> >
> > +1 would be willing to help out when possible
> >
>


Re: Model to detect the gender

2016-06-29 Thread William Colen
To create a NER model OpenNLP extracts features from the context, things
such as: word prefix and suffix, next word, previous word, previous word
prefix and suffix, next word prefix and suffix etc.
When you don't configure the feature generator it will apply the default:
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api

Default feature generator:

AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator(
 *new* AdaptiveFeatureGenerator[]{
   *new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2, 2),
   *new* WindowFeatureGenerator(*new*
TokenClassFeatureGenerator(true), 2, 2),
   *new* OutcomePriorFeatureGenerator(),
   *new* PreviousMapFeatureGenerator(),
   *new* BigramNameFeatureGenerator(),
   *new* SentenceFeatureGenerator(true, false)
   });


These default features should work for most cases (specially English), but
they of course can be incremented. If you do so, your model will take new
features in account. So yes, you are putting the features in your model.

To configure custom features is not easy. I would start with the default
and use 10-fold cross-validation and take notes of its effectiveness. Than
change/add a feature, evaluate and take notes. Sometimes a feature that we
are sure would help can destroy the model effectiveness.

Regards
William


2016-06-29 7:00 GMT-03:00 Damiano Porta :

> Thank you William! Really appreciated!
>
> I only do not get one point, when you said "You could increment your
> model using
> Custom Feature Generators" does it mean that i can "put" these features
> inside ONE *.bin* file (model) that implement different things, or, name
> finder is one thing and those feature generators other?
>
> Thank you in advance for the clarification.
>
> 2016-06-29 1:23 GMT+02:00 William Colen :
>
> > Not exactly. You would create a new NER model to replace yours.
> >
> > In this approach you would need a corpus like this:
> >
> >  Pierre Vinken  , 61 years old , will join the
> board
> > as a nonexecutive director Nov. 29 .
> > Mr .  Vinken  is chairman of Elsevier N.V. , the
> > Dutch publishing group .  Jessie Robson  is
> > retiring , she was a board member for 5 years .
> >
> >
> > I am not an English native speaker, so I am not sure if the example is
> > clear enough. I tried to use Jessie as a neutral name and "she" as
> > disambiguation.
> >
> > With a corpus big enough maybe you could create a model that outputs both
> > classes, personMale and personFemale. To train a model you can follow
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> >
> > Let's say your results are not good enough. You could increment your
> model
> > using Custom Feature Generators (
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > and
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > ).
> >
> > One of the implemented featuregen can take a dictionary (
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > ).
> > You can also implement other convenient FeatureGenerator, for instance
> > regex.
> >
> > Again, it is just a wild guess of how to implement it. I don't know if it
> > would perform well. I was only thinking how to implement a gender ML
> model
> > that uses the surrounding context.
> >
> > Hope I could clarify.
> >
> > William
> >
> > 2016-06-28 19:15 GMT-03:00 Damiano Porta :
> >
> > > Hi William,
> > > Ok, so you are talking about a kind of pipe where we execute:
> > >
> > > 1. NER (personM for example)
> > > 2. Regex (filter to reduce false positives)
> > > 3. Plain dictionary (filter as above) ?
> > >
> > > Yes we can split out model in two for M and F, it is not a big problem,
> > we
> > > have a database grouped by gender.
> > >
> > > I only have a doubt regarding the use of a dictionary. Because if we
> use
> > a
> > > dictionary to create the model, we could only use it to detect names
> > > without using NER. No?
> > >
> > >
> > >
> > > 2016-06-29 0:10 GMT+02:00 William Colen :
> > >
> > > > Do you plan to use the surrounding context? If yes, maybe you could
> try
> > > to
> > > > split NER in two categories: PersonM and PersonF. Just an idea, never
> > > read
> > > > or tried anything like it. You would need a training corpus with
> these
> > > > classes.
> > > >
> > > > You could add both the plain dictionary and the regex as NER features
> > as
> > > > well and check how it improves.
> > > >
> > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta :
> > > >
> > > > > Hello everybody,
> > > > >
> > > > > we built a NER model to 

Re: Model to detect the gender

2016-06-29 Thread Damiano Porta
Thank you William! Really appreciated!

I only do not get one point, when you said "You could increment your
model using
Custom Feature Generators" does it mean that i can "put" these features
inside ONE *.bin* file (model) that implement different things, or, name
finder is one thing and those feature generators other?

Thank you in advance for the clarification.

2016-06-29 1:23 GMT+02:00 William Colen :

> Not exactly. You would create a new NER model to replace yours.
>
> In this approach you would need a corpus like this:
>
>  Pierre Vinken  , 61 years old , will join the board
> as a nonexecutive director Nov. 29 .
> Mr .  Vinken  is chairman of Elsevier N.V. , the
> Dutch publishing group .  Jessie Robson  is
> retiring , she was a board member for 5 years .
>
>
> I am not an English native speaker, so I am not sure if the example is
> clear enough. I tried to use Jessie as a neutral name and "she" as
> disambiguation.
>
> With a corpus big enough maybe you could create a model that outputs both
> classes, personMale and personFemale. To train a model you can follow
>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
>
> Let's say your results are not good enough. You could increment your model
> using Custom Feature Generators (
>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> and
>
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> ).
>
> One of the implemented featuregen can take a dictionary (
>
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> ).
> You can also implement other convenient FeatureGenerator, for instance
> regex.
>
> Again, it is just a wild guess of how to implement it. I don't know if it
> would perform well. I was only thinking how to implement a gender ML model
> that uses the surrounding context.
>
> Hope I could clarify.
>
> William
>
> 2016-06-28 19:15 GMT-03:00 Damiano Porta :
>
> > Hi William,
> > Ok, so you are talking about a kind of pipe where we execute:
> >
> > 1. NER (personM for example)
> > 2. Regex (filter to reduce false positives)
> > 3. Plain dictionary (filter as above) ?
> >
> > Yes we can split out model in two for M and F, it is not a big problem,
> we
> > have a database grouped by gender.
> >
> > I only have a doubt regarding the use of a dictionary. Because if we use
> a
> > dictionary to create the model, we could only use it to detect names
> > without using NER. No?
> >
> >
> >
> > 2016-06-29 0:10 GMT+02:00 William Colen :
> >
> > > Do you plan to use the surrounding context? If yes, maybe you could try
> > to
> > > split NER in two categories: PersonM and PersonF. Just an idea, never
> > read
> > > or tried anything like it. You would need a training corpus with these
> > > classes.
> > >
> > > You could add both the plain dictionary and the regex as NER features
> as
> > > well and check how it improves.
> > >
> > > 2016-06-28 18:56 GMT-03:00 Damiano Porta :
> > >
> > > > Hello everybody,
> > > >
> > > > we built a NER model to find persons (name) inside our documents.
> > > > We are looking for the best approach to understand if the name is
> > > > male/female.
> > > >
> > > > Possible solutions:
> > > > - Plain dictionary?
> > > > - Regex to check the initial and/letters of the name?
> > > > - Classifier? (naive bayes? Maxent?)
> > > >
> > > > Thanks
> > > >
> > >
> >
>


Re: DeepLearning4J as a ML for OpenNLP

2016-06-29 Thread Boris Galitsky
Hi guys

  I should mention how we used DeepLearning4J for the OpenNLP.Similarity 
project at

https://github.com/bgalitsky/relevance-based-on-parse-trees


The main question is how word2vec models and linguistic information such as 
part trees complement each other. In a word2vec approach any two words can be 
compared. The weakness here is that when learning is based on computing a 
distance between totally unrelated words like 'cat' and 'fly' can be 
meaningless, uninformative and can corrupt a learning model.


In OpenNLP.Similarity component similarity is defined  in terms of parse trees. 
When word2vec is applied on top of parse trees and not as a bag-of-words, we 
only compute the distance between the words with the same semantic role, so the 
model becomes more accurate.


There's a paper on the way which does the assessment of relevance improvent for


word2vec (bag-of-words) [traditional] vs word2vec (parse-trees)


Regards

Boris

[https://avatars3.githubusercontent.com/u/1051120?v=3=400]

bgalitsky/relevance-based-on-parse-trees
github.com
Automatically exported from code.google.com/p/relevance-based-on-parse-trees





From: Anthony Beylerian 
Sent: Wednesday, June 29, 2016 2:13:38 AM
To: dev@opennlp.apache.org
Subject: Re: DeepLearning4J as a ML for OpenNLP

+1 would be willing to help out when possible


Re: Performances of OpenNLP tools

2016-06-29 Thread Anthony Beylerian
How about we keep track of the sets used for performance evaluation and
results in this doc for now:

https://docs.google.com/spreadsheets/d/15c0-u61HNWfQxiDSGjk49M1uBknIfb-LkbP4BDWTB5w/edit?usp=sharing

Will try to take a better look at OntoNotes and what to use from it.
Otherwise, if anyone would like to suggest proper data-sets for testing
each component that would be really helpful

Anthony

On Thu, Jun 23, 2016 at 12:18 AM, Joern Kottmann  wrote:

> It would be nice to get MASC support into the OpenNLP formats package.
>
> Jörn
>
> On Tue, Jun 21, 2016 at 6:18 PM, Jason Baldridge  >
> wrote:
>
> > Jörn is absolutely right about that. Another good source of training data
> > is MASC. I've got some instructions for training models with MASC here:
> >
> > https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial
> >
> > Chalk (now defunct) provided a Scala wrapper around OpenNLP
> functionality,
> > so the instructions there should make it fairly straightforward to adapt
> > MASC data to OpenNLP.
> >
> > -Jason
> >
> > On Tue, 21 Jun 2016 at 10:46 Joern Kottmann  wrote:
> >
> > > There are some research papers which study and compare the performance
> of
> > > NLP toolkits, but be careful often they don't train the NLP tools on
> the
> > > same data and the training data makes a big difference on the
> > performance.
> > >
> > > Jörn
> > >
> > > On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann 
> > > wrote:
> > >
> > > > Just don't use the very old existing models, to get good results you
> > have
> > > > to train on your own data, especially if the domain of the data used
> > for
> > > > training and the data which should be processed doesn't match. The
> old
> > > > models are trained on 90s news, those don't work well on todays news
> > and
> > > > probably much worse on tweets.
> > > >
> > > > OntoNots is a good place to start if the goal is to process news.
> > OpenNLP
> > > > comes with build-in support to train models from OntoNotes.
> > > >
> > > > Jörn
> > > >
> > > > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> > > > chris.a.mattm...@jpl.nasa.gov> wrote:
> > > >
> > > >> This sounds like a fantastic idea.
> > > >>
> > > >> ++
> > > >> Chris Mattmann, Ph.D.
> > > >> Chief Architect
> > > >> Instrument Software and Science Data Systems Section (398)
> > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > >> Office: 168-519, Mailstop: 168-527
> > > >> Email: chris.a.mattm...@nasa.gov
> > > >> WWW:  http://sunset.usc.edu/~mattmann/
> > > >> ++
> > > >> Director, Information Retrieval and Data Science Group (IRDS)
> > > >> Adjunct Associate Professor, Computer Science Department
> > > >> University of Southern California, Los Angeles, CA 90089 USA
> > > >> WWW: http://irds.usc.edu/
> > > >> ++
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On 6/21/16, 12:13 AM, "Anthony Beylerian" <
> > anthonybeyler...@hotmail.com
> > > >
> > > >> wrote:
> > > >>
> > > >> >+1
> > > >> >
> > > >> >Maybe we could put the results of the evaluator tests for each
> > > component
> > > >> somewhere on a webpage and on every release update them.
> > > >> >This is of course provided there are reasonable data sets for
> testing
> > > >> each component.
> > > >> >What do you think?
> > > >> >
> > > >> >Anthony
> > > >> >
> > > >> >> From: mondher.bouaz...@gmail.com
> > > >> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
> > > >> >> Subject: Re: Performances of OpenNLP tools
> > > >> >> To: dev@opennlp.apache.org
> > > >> >>
> > > >> >> Hi,
> > > >> >>
> > > >> >> Thank you for your replies.
> > > >> >>
> > > >> >> Please Jeffrey accept once more my apologies for receiving the
> > email
> > > >> twice.
> > > >> >>
> > > >> >> I also think it would be great to have such studies on the
> > > >> performances of
> > > >> >> OpenNLP.
> > > >> >>
> > > >> >> I have been looking for this information and checked in many
> > places,
> > > >> >> including obviously google scholar, and I haven't found any
> serious
> > > >> studies
> > > >> >> or reliable results. Most of the existing ones report the
> > > performances
> > > >> of
> > > >> >> outdated releases of OpenNLP, and focus more on the execution
> time
> > or
> > > >> >> CPU/RAM consumption, etc.
> > > >> >>
> > > >> >> I think such a comparison will help not only evaluate the overall
> > > >> accuracy,
> > > >> >> but also highlight the issues with the existing models (as a
> matter
> > > of
> > > >> >> fact, the existing models fail to recognize many of the hashtags
> in
> > > >> tweets:
> > > >> >> the tokenizer splits them into the "#" symbol and a word that the
> > PoS
> > > >> >> tagger also fails to recognize).
> > 

Re: DeepLearning4J as a ML for OpenNLP

2016-06-29 Thread Anthony Beylerian
+1 would be willing to help out when possible