Hi,
I tried your code. Very good work so far! Congratulations.
Is the examples/result file corrupted? It has only one line.
Do you plan to implement a simple CLI to use it interactively from command
line, similar to
bin/opennlp Doccat
bin/opennlp TokenNameFinder
?
Also, do you plan to add evaluation tools by extending
AbstractEvaluatorTool and AbstractCrossValidatorTool, as well as the
listener EvaluationErrorPrinter? I found these tools very useful while I am
developing new models and features, maybe you would find it useful as well.
You could also check the DoccatFineGrainedReportListener as a start point
to create a confusion matrix (I think it would be easy because Doccat data
structures are similar to yours).
The result would look like the follow (this is a 300 entries Portuguese
corpus I am building from Facebook messages):
=== Evaluation summary ===
Number of documents: 298
Min sentence size: 1
Max sentence size: 463
Average sentence size: 18,01
Categories count: 4
Accuracy: 61,41%
=== Detailed Accuracy By Tag ===
-------------------------------------------------------------------------
| Tag | Errors | Count | % Err | Precision | Recall | F-Measure |
-------------------------------------------------------------------------
| neutral | 46 | 56 | 0,821 | 0,588 | 0,179 | 0,274 |
| positive | 46 | 70 | 0,657 | 0,48 | 0,343 | 0,4 |
| negative | 18 | 167 | 0,108 | 0,651 | 0,892 | 0,753 |
| spam | 5 | 5 | 1 | 0 | 0 | 0 |
-------------------------------------------------------------------------
=== Confusion matrix ===
a b c d | Accuracy | <-- classified as
<149> 13 4 1 | 89,22% | a = negative
42 <24> 3 1 | 34,29% | b = positive
35 11 <10> . | 17,86% | c = neutral
3 2 . <.>| 0% | d = spam
Regards,
William
2016-06-23 2:11 GMT-03:00 Mattmann, Chris A (3980) <
[email protected]>:
> Thank you Jason!
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>
>
>
>
> On 6/22/16, 8:41 PM, "Jason Baldridge" <[email protected]> wrote:
>
> >Anastasija,
> >
> >There might be a few appropriate sentiment datasets listed in my homework
> >on Twitter sentiment analysis:
> >
> >https://github.com/utcompling/applied-nlp/wiki/Homework5
> >
> >There may also be some useful data sets in the Crowdflower Open Data
> >collection:
> >
> >https://www.crowdflower.com/data-for-everyone/
> >
> >Hope this helps!
> >
> >-Jason
> >
> >On Wed, 22 Jun 2016 at 15:59 Anastasija Mensikova <
> >[email protected]> wrote:
> >
> >> Hi everyone,
> >>
> >> Some updates on our Sentiment Analysis Parser work.
> >>
> >> You might have noticed, I have enhanced our website (the GH page)
> recently,
> >> polished it and made it more user-friendly. My next step will be
> sending a
> >> pull request to Tika. However, my main goal until the end of Google
> Summer
> >> of Code is to enhance the parser in a way that will allow it to work
> >> categorically (in other words, the sentiment determined won't be just
> >> positive or negative, it will have a few categories). This means that my
> >> next step is to look for a categorical open data set (which I will
> >> hopefully do by the end of the weekend the latest) and, of course,
> enhance
> >> my model and training. After that I will look into how the confidence
> >> levels can be increased.
> >>
> >> Have a great day/night!
> >>
> >> Thank you,
> >> Anastasija Mensikova.
> >>
>