Thanks William, this is a great idea. I will discuss it with Anastasija tomorrow.
Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ On 6/28/16, 12:01 PM, "William Colen" <[email protected]> wrote: >Hi, > >I tried your code. Very good work so far! Congratulations. > >Is the examples/result file corrupted? It has only one line. > >Do you plan to implement a simple CLI to use it interactively from command >line, similar to > >bin/opennlp Doccat >bin/opennlp TokenNameFinder > >? > >Also, do you plan to add evaluation tools by extending >AbstractEvaluatorTool and AbstractCrossValidatorTool, as well as the >listener EvaluationErrorPrinter? I found these tools very useful while I am >developing new models and features, maybe you would find it useful as well. > >You could also check the DoccatFineGrainedReportListener as a start point >to create a confusion matrix (I think it would be easy because Doccat data >structures are similar to yours). > >The result would look like the follow (this is a 300 entries Portuguese >corpus I am building from Facebook messages): > > >=== Evaluation summary === > Number of documents: 298 > Min sentence size: 1 > Max sentence size: 463 >Average sentence size: 18,01 > Categories count: 4 > Accuracy: 61,41% > >=== Detailed Accuracy By Tag === > >------------------------------------------------------------------------- >| Tag | Errors | Count | % Err | Precision | Recall | F-Measure | >------------------------------------------------------------------------- >| neutral | 46 | 56 | 0,821 | 0,588 | 0,179 | 0,274 | >| positive | 46 | 70 | 0,657 | 0,48 | 0,343 | 0,4 | >| negative | 18 | 167 | 0,108 | 0,651 | 0,892 | 0,753 | >| spam | 5 | 5 | 1 | 0 | 0 | 0 | >------------------------------------------------------------------------- > >=== Confusion matrix === > > > a b c d | Accuracy | <-- classified as > <149> 13 4 1 | 89,22% | a = negative > 42 <24> 3 1 | 34,29% | b = positive > 35 11 <10> . | 17,86% | c = neutral > 3 2 . <.>| 0% | d = spam > > > > >Regards, >William > >2016-06-23 2:11 GMT-03:00 Mattmann, Chris A (3980) < >[email protected]>: > >> Thank you Jason! >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Director, Information Retrieval and Data Science Group (IRDS) >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> WWW: http://irds.usc.edu/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >> >> >> >> >> On 6/22/16, 8:41 PM, "Jason Baldridge" <[email protected]> wrote: >> >> >Anastasija, >> > >> >There might be a few appropriate sentiment datasets listed in my homework >> >on Twitter sentiment analysis: >> > >> >https://github.com/utcompling/applied-nlp/wiki/Homework5 >> > >> >There may also be some useful data sets in the Crowdflower Open Data >> >collection: >> > >> >https://www.crowdflower.com/data-for-everyone/ >> > >> >Hope this helps! >> > >> >-Jason >> > >> >On Wed, 22 Jun 2016 at 15:59 Anastasija Mensikova < >> >[email protected]> wrote: >> > >> >> Hi everyone, >> >> >> >> Some updates on our Sentiment Analysis Parser work. >> >> >> >> You might have noticed, I have enhanced our website (the GH page) >> recently, >> >> polished it and made it more user-friendly. My next step will be >> sending a >> >> pull request to Tika. However, my main goal until the end of Google >> Summer >> >> of Code is to enhance the parser in a way that will allow it to work >> >> categorically (in other words, the sentiment determined won't be just >> >> positive or negative, it will have a few categories). This means that my >> >> next step is to look for a categorical open data set (which I will >> >> hopefully do by the end of the weekend the latest) and, of course, >> enhance >> >> my model and training. After that I will look into how the confidence >> >> levels can be increased. >> >> >> >> Have a great day/night! >> >> >> >> Thank you, >> >> Anastasija Mensikova. >> >> >>
