Re: Model to detect the gender
Not exactly. You would create a new NER model to replace yours. In this approach you would need a corpus like this: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . Mr . Vinken is chairman of Elsevier N.V. , the Dutch publishing group . Jessie Robson is retiring , she was a board member for 5 years . I am not an English native speaker, so I am not sure if the example is clear enough. I tried to use Jessie as a neutral name and "she" as disambiguation. With a corpus big enough maybe you could create a model that outputs both classes, personMale and personFemale. To train a model you can follow https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training Let's say your results are not good enough. You could increment your model using Custom Feature Generators ( https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen and https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html ). One of the implemented featuregen can take a dictionary ( https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html ). You can also implement other convenient FeatureGenerator, for instance regex. Again, it is just a wild guess of how to implement it. I don't know if it would perform well. I was only thinking how to implement a gender ML model that uses the surrounding context. Hope I could clarify. William 2016-06-28 19:15 GMT-03:00 Damiano Porta: > Hi William, > Ok, so you are talking about a kind of pipe where we execute: > > 1. NER (personM for example) > 2. Regex (filter to reduce false positives) > 3. Plain dictionary (filter as above) ? > > Yes we can split out model in two for M and F, it is not a big problem, we > have a database grouped by gender. > > I only have a doubt regarding the use of a dictionary. Because if we use a > dictionary to create the model, we could only use it to detect names > without using NER. No? > > > > 2016-06-29 0:10 GMT+02:00 William Colen : > > > Do you plan to use the surrounding context? If yes, maybe you could try > to > > split NER in two categories: PersonM and PersonF. Just an idea, never > read > > or tried anything like it. You would need a training corpus with these > > classes. > > > > You could add both the plain dictionary and the regex as NER features as > > well and check how it improves. > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta : > > > > > Hello everybody, > > > > > > we built a NER model to find persons (name) inside our documents. > > > We are looking for the best approach to understand if the name is > > > male/female. > > > > > > Possible solutions: > > > - Plain dictionary? > > > - Regex to check the initial and/letters of the name? > > > - Classifier? (naive bayes? Maxent?) > > > > > > Thanks > > > > > >
Re: Model to detect the gender
Hi William, Ok, so you are talking about a kind of pipe where we execute: 1. NER (personM for example) 2. Regex (filter to reduce false positives) 3. Plain dictionary (filter as above) ? Yes we can split out model in two for M and F, it is not a big problem, we have a database grouped by gender. I only have a doubt regarding the use of a dictionary. Because if we use a dictionary to create the model, we could only use it to detect names without using NER. No? 2016-06-29 0:10 GMT+02:00 William Colen: > Do you plan to use the surrounding context? If yes, maybe you could try to > split NER in two categories: PersonM and PersonF. Just an idea, never read > or tried anything like it. You would need a training corpus with these > classes. > > You could add both the plain dictionary and the regex as NER features as > well and check how it improves. > > 2016-06-28 18:56 GMT-03:00 Damiano Porta : > > > Hello everybody, > > > > we built a NER model to find persons (name) inside our documents. > > We are looking for the best approach to understand if the name is > > male/female. > > > > Possible solutions: > > - Plain dictionary? > > - Regex to check the initial and/letters of the name? > > - Classifier? (naive bayes? Maxent?) > > > > Thanks > > >
Re: Model to detect the gender
Do you plan to use the surrounding context? If yes, maybe you could try to split NER in two categories: PersonM and PersonF. Just an idea, never read or tried anything like it. You would need a training corpus with these classes. You could add both the plain dictionary and the regex as NER features as well and check how it improves. 2016-06-28 18:56 GMT-03:00 Damiano Porta: > Hello everybody, > > we built a NER model to find persons (name) inside our documents. > We are looking for the best approach to understand if the name is > male/female. > > Possible solutions: > - Plain dictionary? > - Regex to check the initial and/letters of the name? > - Classifier? (naive bayes? Maxent?) > > Thanks >
Re: DeepLearning4J as a ML for OpenNLP
I had briefly looked into it a while ago, would be nice to collaborate there. Tommaso Il giorno mar 28 giu 2016 alle 23:26 Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> ha scritto: > Yep I think so - you may also look at SciSpark > http://scispark.jpl.nasa.gov > where we are using DL4J/ND4J and Breeze interchangeably here. > > Cheers, > Chris > > ++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++ > > > > > > > > > > > On 6/28/16, 2:23 PM, "William Colen"wrote: > > >Hi, > > > >Do you think it would be possible to implement a ML based on DL4J? > > > >http://deeplearning4j.org/ > > > >Thank you > >William >
Re: DeepLearning4J as a ML for OpenNLP
Thank you for pointing, Prof. Chris. Can you please point me the exact project at http://scispark.jpl.nasa.gov/ I should look at? It is huge. Thank you again. William William Colen 2016-06-28 18:26 GMT-03:00 Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov>: > Yep I think so - you may also look at SciSpark > http://scispark.jpl.nasa.gov > where we are using DL4J/ND4J and Breeze interchangeably here. > > Cheers, > Chris > > ++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++ > > > > > > > > > > > On 6/28/16, 2:23 PM, "William Colen"wrote: > > >Hi, > > > >Do you think it would be possible to implement a ML based on DL4J? > > > >http://deeplearning4j.org/ > > > >Thank you > >William >
Re: DeepLearning4J as a ML for OpenNLP
Suneel, I mean an implementation so we can use DL4J to train the OpenNLP models, just like we already do in opennlp.tools.ml package with Maxent, Perceptron, NayveBayes. I think it was Jörn who also did a few others that are in the SandBox: Mallet and Mahout. Thank you! William 2016-06-28 18:27 GMT-03:00 Suneel Marthi: > Are u looking at using ND4J (from Deeplearning4j project) as the Math > backend for ML work? If so, yes. > > > From: William Colen > To: "dev@opennlp.apache.org" > Sent: Tuesday, June 28, 2016 5:23 PM > Subject: DeepLearning4J as a ML for OpenNLP > > Hi, > > Do you think it would be possible to implement a ML based on DL4J? > > http://deeplearning4j.org/ > > Thank you > William > > > >
Re: DeepLearning4J as a ML for OpenNLP
Are u looking at using ND4J (from Deeplearning4j project) as the Math backend for ML work? If so, yes. From: William ColenTo: "dev@opennlp.apache.org" Sent: Tuesday, June 28, 2016 5:23 PM Subject: DeepLearning4J as a ML for OpenNLP Hi, Do you think it would be possible to implement a ML based on DL4J? http://deeplearning4j.org/ Thank you William
Re: Sentiment Analysis Parser updates
Thanks William, this is a great idea. I will discuss it with Anastasija tomorrow. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 6/28/16, 12:01 PM, "William Colen"wrote: >Hi, > >I tried your code. Very good work so far! Congratulations. > >Is the examples/result file corrupted? It has only one line. > >Do you plan to implement a simple CLI to use it interactively from command >line, similar to > >bin/opennlp Doccat >bin/opennlp TokenNameFinder > >? > >Also, do you plan to add evaluation tools by extending >AbstractEvaluatorTool and AbstractCrossValidatorTool, as well as the >listener EvaluationErrorPrinter? I found these tools very useful while I am >developing new models and features, maybe you would find it useful as well. > >You could also check the DoccatFineGrainedReportListener as a start point >to create a confusion matrix (I think it would be easy because Doccat data >structures are similar to yours). > >The result would look like the follow (this is a 300 entries Portuguese >corpus I am building from Facebook messages): > > >=== Evaluation summary === > Number of documents:298 >Min sentence size: 1 >Max sentence size:463 >Average sentence size: 18,01 > Categories count: 4 > Accuracy: 61,41% > >=== Detailed Accuracy By Tag === > >- >| Tag | Errors | Count | % Err | Precision | Recall | F-Measure | >- >| neutral | 46 | 56 | 0,821 | 0,588 | 0,179 | 0,274 | >| positive | 46 | 70 | 0,657 | 0,48 | 0,343 | 0,4 | >| negative | 18 |167 | 0,108 | 0,651 | 0,892 | 0,753 | >| spam | 5 | 5 | 1 | 0 | 0 | 0 | >- > >=== Confusion matrix === > > >a b c d | Accuracy | <-- classified as > <149> 13 4 1 | 89,22% | a = negative > 42 <24>3 1 | 34,29% | b = positive > 3511 <10>. | 17,86% | c = neutral >3 2 .<.>| 0% | d = spam > > > > >Regards, >William > >2016-06-23 2:11 GMT-03:00 Mattmann, Chris A (3980) < >chris.a.mattm...@jpl.nasa.gov>: > >> Thank you Jason! >> >> ++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++ >> Director, Information Retrieval and Data Science Group (IRDS) >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> WWW: http://irds.usc.edu/ >> ++ >> >> >> >> >> >> >> >> >> >> >> On 6/22/16, 8:41 PM, "Jason Baldridge" wrote: >> >> >Anastasija, >> > >> >There might be a few appropriate sentiment datasets listed in my homework >> >on Twitter sentiment analysis: >> > >> >https://github.com/utcompling/applied-nlp/wiki/Homework5 >> > >> >There may also be some useful data sets in the Crowdflower Open Data >> >collection: >> > >> >https://www.crowdflower.com/data-for-everyone/ >> > >> >Hope this helps! >> > >> >-Jason >> > >> >On Wed, 22 Jun 2016 at 15:59 Anastasija Mensikova < >> >mensikova.anastas...@gmail.com> wrote: >> > >> >> Hi everyone, >> >> >> >> Some updates on our Sentiment Analysis Parser work. >> >> >> >> You might have noticed, I have enhanced our website (the GH page) >> recently, >> >> polished it and made it more user-friendly. My next step will be >> sending a >> >> pull request to Tika. However, my main goal until the end of Google >> Summer >> >> of Code is to enhance the parser in a way that will allow it to work >> >> categorically (in other words, the sentiment determined won't be just >> >> positive or negative, it will have a few categories). This means that my >> >> next step is to look for a categorical open data set (which I will >> >> hopefully do by
Re: DeepLearning4J as a ML for OpenNLP
Yep I think so - you may also look at SciSpark http://scispark.jpl.nasa.gov where we are using DL4J/ND4J and Breeze interchangeably here. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 6/28/16, 2:23 PM, "William Colen"wrote: >Hi, > >Do you think it would be possible to implement a ML based on DL4J? > >http://deeplearning4j.org/ > >Thank you >William
DeepLearning4J as a ML for OpenNLP
Hi, Do you think it would be possible to implement a ML based on DL4J? http://deeplearning4j.org/ Thank you William
Re: Sentiment Analysis Parser updates
Hi, I tried your code. Very good work so far! Congratulations. Is the examples/result file corrupted? It has only one line. Do you plan to implement a simple CLI to use it interactively from command line, similar to bin/opennlp Doccat bin/opennlp TokenNameFinder ? Also, do you plan to add evaluation tools by extending AbstractEvaluatorTool and AbstractCrossValidatorTool, as well as the listener EvaluationErrorPrinter? I found these tools very useful while I am developing new models and features, maybe you would find it useful as well. You could also check the DoccatFineGrainedReportListener as a start point to create a confusion matrix (I think it would be easy because Doccat data structures are similar to yours). The result would look like the follow (this is a 300 entries Portuguese corpus I am building from Facebook messages): === Evaluation summary === Number of documents:298 Min sentence size: 1 Max sentence size:463 Average sentence size: 18,01 Categories count: 4 Accuracy: 61,41% === Detailed Accuracy By Tag === - | Tag | Errors | Count | % Err | Precision | Recall | F-Measure | - | neutral | 46 | 56 | 0,821 | 0,588 | 0,179 | 0,274 | | positive | 46 | 70 | 0,657 | 0,48 | 0,343 | 0,4 | | negative | 18 |167 | 0,108 | 0,651 | 0,892 | 0,753 | | spam | 5 | 5 | 1 | 0 | 0 | 0 | - === Confusion matrix === a b c d | Accuracy | <-- classified as <149> 13 4 1 | 89,22% | a = negative 42 <24>3 1 | 34,29% | b = positive 3511 <10>. | 17,86% | c = neutral 3 2 .<.>| 0% | d = spam Regards, William 2016-06-23 2:11 GMT-03:00 Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov>: > Thank you Jason! > > ++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++ > > > > > > > > > > > On 6/22/16, 8:41 PM, "Jason Baldridge"wrote: > > >Anastasija, > > > >There might be a few appropriate sentiment datasets listed in my homework > >on Twitter sentiment analysis: > > > >https://github.com/utcompling/applied-nlp/wiki/Homework5 > > > >There may also be some useful data sets in the Crowdflower Open Data > >collection: > > > >https://www.crowdflower.com/data-for-everyone/ > > > >Hope this helps! > > > >-Jason > > > >On Wed, 22 Jun 2016 at 15:59 Anastasija Mensikova < > >mensikova.anastas...@gmail.com> wrote: > > > >> Hi everyone, > >> > >> Some updates on our Sentiment Analysis Parser work. > >> > >> You might have noticed, I have enhanced our website (the GH page) > recently, > >> polished it and made it more user-friendly. My next step will be > sending a > >> pull request to Tika. However, my main goal until the end of Google > Summer > >> of Code is to enhance the parser in a way that will allow it to work > >> categorically (in other words, the sentiment determined won't be just > >> positive or negative, it will have a few categories). This means that my > >> next step is to look for a categorical open data set (which I will > >> hopefully do by the end of the weekend the latest) and, of course, > enhance > >> my model and training. After that I will look into how the confidence > >> levels can be increased. > >> > >> Have a great day/night! > >> > >> Thank you, > >> Anastasija Mensikova. > >> >
Re: Usages of Adaptive features.
You can also activate the monitor from command line, using misclassified and detailedF: bin/opennlp TokenNameFinderCrossValidator Usage: opennlp TokenNameFinderCrossValidator[.ontonotes|.bionlp2004|.conll03|.conll02|.ad|.evalita|.muc6|.brat] [-factory factoryName] [-resources resourcesDir] [-type modelType] [-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language [-misclassified true|false] [-folds num] [-detailedF true|false] -data sampleData [-encoding charsetName] Arguments description: -factory factoryName A sub-class of TokenNameFinderFactory -resources resourcesDir The resources directory -type modelType The type of the token name finder model -featuregen featuregenFile The feature generator descriptor file -nameTypes types name types to use for training -sequenceCodec codec sequence codec used to code name spans -params paramsFile training parameters file. -lang language language which is being processed. -misclassified true|false if true will print false negatives and false positives. -folds num number of folds, default is 10. -detailedF true|false if true will print detailed FMeasure results. -data sampleData data to be used, usually a file name. -encoding charsetName encoding for reading and writing text, if absent the system default is used. William Colen 2016-06-28 11:04 GMT-03:00 William Colen: > > https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen > > Do you have a specific question? > > You can try the default feature generator and check how your model will > perform in terms of precision and recall. You can take a look at the kind > of errors (use a EvaluationMonitor > https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/eval/EvaluationMonitor.html) > and try to figure out features that it is missing that would give a hint > how to perform better. > Add the features and check precision and recall again. > > 2016-06-21 13:45 GMT-03:00 : > >> Please share the usages of Adaptive features that are used in NER tagging? >> >> Regards, >> Rakesh.P >> > >