Re: Model to detect the gender

2016-06-28 Thread William Colen
Not exactly. You would create a new NER model to replace yours.

In this approach you would need a corpus like this:

 Pierre Vinken  , 61 years old , will join the board
as a nonexecutive director Nov. 29 .
Mr .  Vinken  is chairman of Elsevier N.V. , the
Dutch publishing group .  Jessie Robson  is
retiring , she was a board member for 5 years .


I am not an English native speaker, so I am not sure if the example is
clear enough. I tried to use Jessie as a neutral name and "she" as
disambiguation.

With a corpus big enough maybe you could create a model that outputs both
classes, personMale and personFemale. To train a model you can follow
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training

Let's say your results are not good enough. You could increment your model
using Custom Feature Generators (
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
and
https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
).

One of the implemented featuregen can take a dictionary (
https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
).
You can also implement other convenient FeatureGenerator, for instance
regex.

Again, it is just a wild guess of how to implement it. I don't know if it
would perform well. I was only thinking how to implement a gender ML model
that uses the surrounding context.

Hope I could clarify.

William

2016-06-28 19:15 GMT-03:00 Damiano Porta :

> Hi William,
> Ok, so you are talking about a kind of pipe where we execute:
>
> 1. NER (personM for example)
> 2. Regex (filter to reduce false positives)
> 3. Plain dictionary (filter as above) ?
>
> Yes we can split out model in two for M and F, it is not a big problem, we
> have a database grouped by gender.
>
> I only have a doubt regarding the use of a dictionary. Because if we use a
> dictionary to create the model, we could only use it to detect names
> without using NER. No?
>
>
>
> 2016-06-29 0:10 GMT+02:00 William Colen :
>
> > Do you plan to use the surrounding context? If yes, maybe you could try
> to
> > split NER in two categories: PersonM and PersonF. Just an idea, never
> read
> > or tried anything like it. You would need a training corpus with these
> > classes.
> >
> > You could add both the plain dictionary and the regex as NER features as
> > well and check how it improves.
> >
> > 2016-06-28 18:56 GMT-03:00 Damiano Porta :
> >
> > > Hello everybody,
> > >
> > > we built a NER model to find persons (name) inside our documents.
> > > We are looking for the best approach to understand if the name is
> > > male/female.
> > >
> > > Possible solutions:
> > > - Plain dictionary?
> > > - Regex to check the initial and/letters of the name?
> > > - Classifier? (naive bayes? Maxent?)
> > >
> > > Thanks
> > >
> >
>


Re: Model to detect the gender

2016-06-28 Thread Damiano Porta
Hi William,
Ok, so you are talking about a kind of pipe where we execute:

1. NER (personM for example)
2. Regex (filter to reduce false positives)
3. Plain dictionary (filter as above) ?

Yes we can split out model in two for M and F, it is not a big problem, we
have a database grouped by gender.

I only have a doubt regarding the use of a dictionary. Because if we use a
dictionary to create the model, we could only use it to detect names
without using NER. No?



2016-06-29 0:10 GMT+02:00 William Colen :

> Do you plan to use the surrounding context? If yes, maybe you could try to
> split NER in two categories: PersonM and PersonF. Just an idea, never read
> or tried anything like it. You would need a training corpus with these
> classes.
>
> You could add both the plain dictionary and the regex as NER features as
> well and check how it improves.
>
> 2016-06-28 18:56 GMT-03:00 Damiano Porta :
>
> > Hello everybody,
> >
> > we built a NER model to find persons (name) inside our documents.
> > We are looking for the best approach to understand if the name is
> > male/female.
> >
> > Possible solutions:
> > - Plain dictionary?
> > - Regex to check the initial and/letters of the name?
> > - Classifier? (naive bayes? Maxent?)
> >
> > Thanks
> >
>


Re: Model to detect the gender

2016-06-28 Thread William Colen
Do you plan to use the surrounding context? If yes, maybe you could try to
split NER in two categories: PersonM and PersonF. Just an idea, never read
or tried anything like it. You would need a training corpus with these
classes.

You could add both the plain dictionary and the regex as NER features as
well and check how it improves.

2016-06-28 18:56 GMT-03:00 Damiano Porta :

> Hello everybody,
>
> we built a NER model to find persons (name) inside our documents.
> We are looking for the best approach to understand if the name is
> male/female.
>
> Possible solutions:
> - Plain dictionary?
> - Regex to check the initial and/letters of the name?
> - Classifier? (naive bayes? Maxent?)
>
> Thanks
>


Re: DeepLearning4J as a ML for OpenNLP

2016-06-28 Thread Tommaso Teofili
I had briefly looked into it a while ago, would be nice to collaborate
there.

Tommaso


Il giorno mar 28 giu 2016 alle 23:26 Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> ha scritto:

> Yep I think so - you may also look at SciSpark
> http://scispark.jpl.nasa.gov
> where we are using DL4J/ND4J and Breeze interchangeably here.
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
>
>
>
>
>
>
>
>
> On 6/28/16, 2:23 PM, "William Colen"  wrote:
>
> >Hi,
> >
> >Do you think it would be possible to implement a ML based on DL4J?
> >
> >http://deeplearning4j.org/
> >
> >Thank you
> >William
>


Re: DeepLearning4J as a ML for OpenNLP

2016-06-28 Thread William Colen
Thank you for pointing, Prof. Chris. Can you please point me the exact
project at http://scispark.jpl.nasa.gov/ I should look at? It is huge.

Thank you again.
William

William Colen

2016-06-28 18:26 GMT-03:00 Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov>:

> Yep I think so - you may also look at SciSpark
> http://scispark.jpl.nasa.gov
> where we are using DL4J/ND4J and Breeze interchangeably here.
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
>
>
>
>
>
>
>
>
> On 6/28/16, 2:23 PM, "William Colen"  wrote:
>
> >Hi,
> >
> >Do you think it would be possible to implement a ML based on DL4J?
> >
> >http://deeplearning4j.org/
> >
> >Thank you
> >William
>


Re: DeepLearning4J as a ML for OpenNLP

2016-06-28 Thread William Colen
Suneel,

I mean an implementation so we can use DL4J to train the OpenNLP models,
just like we already do in opennlp.tools.ml package with Maxent,
Perceptron, NayveBayes. I think it was Jörn who also did a few others that
are in the SandBox: Mallet and Mahout.

Thank you!
William

2016-06-28 18:27 GMT-03:00 Suneel Marthi :

> Are u looking at using ND4J (from Deeplearning4j project) as the Math
> backend for ML work? If so, yes.
>
>
>   From: William Colen 
>  To: "dev@opennlp.apache.org" 
>  Sent: Tuesday, June 28, 2016 5:23 PM
>  Subject: DeepLearning4J as a ML for OpenNLP
>
> Hi,
>
> Do you think it would be possible to implement a ML based on DL4J?
>
> http://deeplearning4j.org/
>
> Thank you
> William
>
>
>
>


Re: DeepLearning4J as a ML for OpenNLP

2016-06-28 Thread Suneel Marthi
Are u looking at using ND4J (from Deeplearning4j project) as the Math backend 
for ML work? If so, yes.


  From: William Colen 
 To: "dev@opennlp.apache.org"  
 Sent: Tuesday, June 28, 2016 5:23 PM
 Subject: DeepLearning4J as a ML for OpenNLP
   
Hi,

Do you think it would be possible to implement a ML based on DL4J?

http://deeplearning4j.org/

Thank you
William


   

Re: Sentiment Analysis Parser updates

2016-06-28 Thread Mattmann, Chris A (3980)
Thanks William, this is a great idea. I will discuss it with 
Anastasija tomorrow.


Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/28/16, 12:01 PM, "William Colen"  wrote:

>Hi,
>
>I tried your code. Very good work so far! Congratulations.
>
>Is the examples/result file corrupted? It has only one line.
>
>Do you plan to implement a simple CLI to use it interactively from command
>line, similar to
>
>bin/opennlp Doccat
>bin/opennlp TokenNameFinder
>
>?
>
>Also, do you plan to add evaluation tools by extending
>AbstractEvaluatorTool and AbstractCrossValidatorTool, as well as the
>listener EvaluationErrorPrinter? I found these tools very useful while I am
>developing new models and features, maybe you would find it useful as well.
>
>You could also check the DoccatFineGrainedReportListener as a start point
>to create a confusion matrix (I think it would be easy because Doccat data
>structures are similar to yours).
>
>The result would look like the follow (this is a 300 entries Portuguese
>corpus I am building from Facebook messages):
>
>
>=== Evaluation summary ===
>  Number of documents:298
>Min sentence size:  1
>Max sentence size:463
>Average sentence size:  18,01
> Categories count:  4
> Accuracy: 61,41%
>
>=== Detailed Accuracy By Tag ===
>
>-
>|  Tag | Errors |  Count |   % Err | Precision | Recall | F-Measure |
>-
>|  neutral | 46 | 56 | 0,821   | 0,588 | 0,179  | 0,274 |
>| positive | 46 | 70 | 0,657   | 0,48  | 0,343  | 0,4   |
>| negative | 18 |167 | 0,108   | 0,651 | 0,892  | 0,753 |
>| spam |  5 |  5 | 1   | 0 | 0  | 0 |
>-
>
>=== Confusion matrix ===
>
>
>a b c d | Accuracy | <-- classified as
> <149>   13 4 1 |   89,22% |   a = negative
>   42   <24>3 1 |   34,29% |   b = positive
>   3511   <10>. |   17,86% |   c = neutral
>3 2 .<.>|   0% |   d = spam
>
>
>
>
>Regards,
>William
>
>2016-06-23 2:11 GMT-03:00 Mattmann, Chris A (3980) <
>chris.a.mattm...@jpl.nasa.gov>:
>
>> Thank you Jason!
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 6/22/16, 8:41 PM, "Jason Baldridge"  wrote:
>>
>> >Anastasija,
>> >
>> >There might be a few appropriate sentiment datasets listed in my homework
>> >on Twitter sentiment analysis:
>> >
>> >https://github.com/utcompling/applied-nlp/wiki/Homework5
>> >
>> >There may also be some useful data sets in the Crowdflower Open Data
>> >collection:
>> >
>> >https://www.crowdflower.com/data-for-everyone/
>> >
>> >Hope this helps!
>> >
>> >-Jason
>> >
>> >On Wed, 22 Jun 2016 at 15:59 Anastasija Mensikova <
>> >mensikova.anastas...@gmail.com> wrote:
>> >
>> >> Hi everyone,
>> >>
>> >> Some updates on our Sentiment Analysis Parser work.
>> >>
>> >> You might have noticed, I have enhanced our website (the GH page)
>> recently,
>> >> polished it and made it more user-friendly. My next step will be
>> sending a
>> >> pull request to Tika. However, my main goal until the end of Google
>> Summer
>> >> of Code is to enhance the parser in a way that will allow it to work
>> >> categorically (in other words, the sentiment determined won't be just
>> >> positive or negative, it will have a few categories). This means that my
>> >> next step is to look for a categorical open data set (which I will
>> >> hopefully do by 

Re: DeepLearning4J as a ML for OpenNLP

2016-06-28 Thread Mattmann, Chris A (3980)
Yep I think so - you may also look at SciSpark http://scispark.jpl.nasa.gov
where we are using DL4J/ND4J and Breeze interchangeably here.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/28/16, 2:23 PM, "William Colen"  wrote:

>Hi,
>
>Do you think it would be possible to implement a ML based on DL4J?
>
>http://deeplearning4j.org/
>
>Thank you
>William


DeepLearning4J as a ML for OpenNLP

2016-06-28 Thread William Colen
Hi,

Do you think it would be possible to implement a ML based on DL4J?

http://deeplearning4j.org/

Thank you
William


Re: Sentiment Analysis Parser updates

2016-06-28 Thread William Colen
Hi,

I tried your code. Very good work so far! Congratulations.

Is the examples/result file corrupted? It has only one line.

Do you plan to implement a simple CLI to use it interactively from command
line, similar to

bin/opennlp Doccat
bin/opennlp TokenNameFinder

?

Also, do you plan to add evaluation tools by extending
AbstractEvaluatorTool and AbstractCrossValidatorTool, as well as the
listener EvaluationErrorPrinter? I found these tools very useful while I am
developing new models and features, maybe you would find it useful as well.

You could also check the DoccatFineGrainedReportListener as a start point
to create a confusion matrix (I think it would be easy because Doccat data
structures are similar to yours).

The result would look like the follow (this is a 300 entries Portuguese
corpus I am building from Facebook messages):


=== Evaluation summary ===
  Number of documents:298
Min sentence size:  1
Max sentence size:463
Average sentence size:  18,01
 Categories count:  4
 Accuracy: 61,41%

=== Detailed Accuracy By Tag ===

-
|  Tag | Errors |  Count |   % Err | Precision | Recall | F-Measure |
-
|  neutral | 46 | 56 | 0,821   | 0,588 | 0,179  | 0,274 |
| positive | 46 | 70 | 0,657   | 0,48  | 0,343  | 0,4   |
| negative | 18 |167 | 0,108   | 0,651 | 0,892  | 0,753 |
| spam |  5 |  5 | 1   | 0 | 0  | 0 |
-

=== Confusion matrix ===


a b c d | Accuracy | <-- classified as
 <149>   13 4 1 |   89,22% |   a = negative
   42   <24>3 1 |   34,29% |   b = positive
   3511   <10>. |   17,86% |   c = neutral
3 2 .<.>|   0% |   d = spam




Regards,
William

2016-06-23 2:11 GMT-03:00 Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov>:

> Thank you Jason!
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
>
>
>
>
>
>
>
>
> On 6/22/16, 8:41 PM, "Jason Baldridge"  wrote:
>
> >Anastasija,
> >
> >There might be a few appropriate sentiment datasets listed in my homework
> >on Twitter sentiment analysis:
> >
> >https://github.com/utcompling/applied-nlp/wiki/Homework5
> >
> >There may also be some useful data sets in the Crowdflower Open Data
> >collection:
> >
> >https://www.crowdflower.com/data-for-everyone/
> >
> >Hope this helps!
> >
> >-Jason
> >
> >On Wed, 22 Jun 2016 at 15:59 Anastasija Mensikova <
> >mensikova.anastas...@gmail.com> wrote:
> >
> >> Hi everyone,
> >>
> >> Some updates on our Sentiment Analysis Parser work.
> >>
> >> You might have noticed, I have enhanced our website (the GH page)
> recently,
> >> polished it and made it more user-friendly. My next step will be
> sending a
> >> pull request to Tika. However, my main goal until the end of Google
> Summer
> >> of Code is to enhance the parser in a way that will allow it to work
> >> categorically (in other words, the sentiment determined won't be just
> >> positive or negative, it will have a few categories). This means that my
> >> next step is to look for a categorical open data set (which I will
> >> hopefully do by the end of the weekend the latest) and, of course,
> enhance
> >> my model and training. After that I will look into how the confidence
> >> levels can be increased.
> >>
> >> Have a great day/night!
> >>
> >> Thank you,
> >> Anastasija Mensikova.
> >>
>


Re: Usages of Adaptive features.

2016-06-28 Thread William Colen
You can also activate the monitor from command line, using misclassified
and detailedF:

bin/opennlp TokenNameFinderCrossValidator
Usage: opennlp
TokenNameFinderCrossValidator[.ontonotes|.bionlp2004|.conll03|.conll02|.ad|.evalita|.muc6|.brat]
[-factory factoryName] [-resources resourcesDir] [-type modelType]
[-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec]
[-params paramsFile] -lang language [-misclassified true|false] [-folds
num] [-detailedF true|false] -data sampleData [-encoding charsetName]

Arguments description:
-factory factoryName
A sub-class of TokenNameFinderFactory
-resources resourcesDir
The resources directory
-type modelType
The type of the token name finder model
-featuregen featuregenFile
The feature generator descriptor file
-nameTypes types
name types to use for training
-sequenceCodec codec
sequence codec used to code name spans
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-misclassified true|false
if true will print false negatives and false positives.
-folds num
number of folds, default is 10.
-detailedF true|false
if true will print detailed FMeasure results.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.

William Colen

2016-06-28 11:04 GMT-03:00 William Colen :

>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
>
> Do you have a specific question?
>
> You can try the default feature generator and check how your model will
> perform in terms of precision and recall. You can take a look at the kind
> of errors (use a EvaluationMonitor
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/eval/EvaluationMonitor.html)
> and try to figure out features that it is missing that would give a hint
> how to perform better.
> Add the features and check  precision and recall again.
>
> 2016-06-21 13:45 GMT-03:00 :
>
>> Please share the usages of Adaptive features that are used in NER tagging?
>>
>> Regards,
>> Rakesh.P
>>
>
>