Re: Sentiment Analysis Parser updates

2016-06-22 Thread Mattmann, Chris A (3980)
Thank you Jason!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/22/16, 8:41 PM, "Jason Baldridge"  wrote:

>Anastasija,
>
>There might be a few appropriate sentiment datasets listed in my homework
>on Twitter sentiment analysis:
>
>https://github.com/utcompling/applied-nlp/wiki/Homework5
>
>There may also be some useful data sets in the Crowdflower Open Data
>collection:
>
>https://www.crowdflower.com/data-for-everyone/
>
>Hope this helps!
>
>-Jason
>
>On Wed, 22 Jun 2016 at 15:59 Anastasija Mensikova <
>mensikova.anastas...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> Some updates on our Sentiment Analysis Parser work.
>>
>> You might have noticed, I have enhanced our website (the GH page) recently,
>> polished it and made it more user-friendly. My next step will be sending a
>> pull request to Tika. However, my main goal until the end of Google Summer
>> of Code is to enhance the parser in a way that will allow it to work
>> categorically (in other words, the sentiment determined won't be just
>> positive or negative, it will have a few categories). This means that my
>> next step is to look for a categorical open data set (which I will
>> hopefully do by the end of the weekend the latest) and, of course, enhance
>> my model and training. After that I will look into how the confidence
>> levels can be increased.
>>
>> Have a great day/night!
>>
>> Thank you,
>> Anastasija Mensikova.
>>


Re: Sentiment Analysis Parser updates

2016-06-22 Thread Jason Baldridge
Anastasija,

There might be a few appropriate sentiment datasets listed in my homework
on Twitter sentiment analysis:

https://github.com/utcompling/applied-nlp/wiki/Homework5

There may also be some useful data sets in the Crowdflower Open Data
collection:

https://www.crowdflower.com/data-for-everyone/

Hope this helps!

-Jason

On Wed, 22 Jun 2016 at 15:59 Anastasija Mensikova <
mensikova.anastas...@gmail.com> wrote:

> Hi everyone,
>
> Some updates on our Sentiment Analysis Parser work.
>
> You might have noticed, I have enhanced our website (the GH page) recently,
> polished it and made it more user-friendly. My next step will be sending a
> pull request to Tika. However, my main goal until the end of Google Summer
> of Code is to enhance the parser in a way that will allow it to work
> categorically (in other words, the sentiment determined won't be just
> positive or negative, it will have a few categories). This means that my
> next step is to look for a categorical open data set (which I will
> hopefully do by the end of the weekend the latest) and, of course, enhance
> my model and training. After that I will look into how the confidence
> levels can be increased.
>
> Have a great day/night!
>
> Thank you,
> Anastasija Mensikova.
>


Re: Sentiment Analysis Parser updates

2016-06-22 Thread Mattmann, Chris A (3980)
Great work Anastasija!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/22/16, 1:55 PM, "Anastasija Mensikova"  
wrote:

>Hi everyone,
>
>Some updates on our Sentiment Analysis Parser work.
>
>You might have noticed, I have enhanced our website (the GH page) recently,
>polished it and made it more user-friendly. My next step will be sending a
>pull request to Tika. However, my main goal until the end of Google Summer
>of Code is to enhance the parser in a way that will allow it to work
>categorically (in other words, the sentiment determined won't be just
>positive or negative, it will have a few categories). This means that my
>next step is to look for a categorical open data set (which I will
>hopefully do by the end of the weekend the latest) and, of course, enhance
>my model and training. After that I will look into how the confidence
>levels can be increased.
>
>Have a great day/night!
>
>Thank you,
>Anastasija Mensikova.


Sentiment Analysis Parser updates

2016-06-22 Thread Anastasija Mensikova
Hi everyone,

Some updates on our Sentiment Analysis Parser work.

You might have noticed, I have enhanced our website (the GH page) recently,
polished it and made it more user-friendly. My next step will be sending a
pull request to Tika. However, my main goal until the end of Google Summer
of Code is to enhance the parser in a way that will allow it to work
categorically (in other words, the sentiment determined won't be just
positive or negative, it will have a few categories). This means that my
next step is to look for a categorical open data set (which I will
hopefully do by the end of the weekend the latest) and, of course, enhance
my model and training. After that I will look into how the confidence
levels can be increased.

Have a great day/night!

Thank you,
Anastasija Mensikova.


Re: Performances of OpenNLP tools

2016-06-22 Thread Joern Kottmann
It would be nice to get MASC support into the OpenNLP formats package.

Jörn

On Tue, Jun 21, 2016 at 6:18 PM, Jason Baldridge 
wrote:

> Jörn is absolutely right about that. Another good source of training data
> is MASC. I've got some instructions for training models with MASC here:
>
> https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial
>
> Chalk (now defunct) provided a Scala wrapper around OpenNLP functionality,
> so the instructions there should make it fairly straightforward to adapt
> MASC data to OpenNLP.
>
> -Jason
>
> On Tue, 21 Jun 2016 at 10:46 Joern Kottmann  wrote:
>
> > There are some research papers which study and compare the performance of
> > NLP toolkits, but be careful often they don't train the NLP tools on the
> > same data and the training data makes a big difference on the
> performance.
> >
> > Jörn
> >
> > On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann 
> > wrote:
> >
> > > Just don't use the very old existing models, to get good results you
> have
> > > to train on your own data, especially if the domain of the data used
> for
> > > training and the data which should be processed doesn't match. The old
> > > models are trained on 90s news, those don't work well on todays news
> and
> > > probably much worse on tweets.
> > >
> > > OntoNots is a good place to start if the goal is to process news.
> OpenNLP
> > > comes with build-in support to train models from OntoNotes.
> > >
> > > Jörn
> > >
> > > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> > > chris.a.mattm...@jpl.nasa.gov> wrote:
> > >
> > >> This sounds like a fantastic idea.
> > >>
> > >> ++
> > >> Chris Mattmann, Ph.D.
> > >> Chief Architect
> > >> Instrument Software and Science Data Systems Section (398)
> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >> Office: 168-519, Mailstop: 168-527
> > >> Email: chris.a.mattm...@nasa.gov
> > >> WWW:  http://sunset.usc.edu/~mattmann/
> > >> ++
> > >> Director, Information Retrieval and Data Science Group (IRDS)
> > >> Adjunct Associate Professor, Computer Science Department
> > >> University of Southern California, Los Angeles, CA 90089 USA
> > >> WWW: http://irds.usc.edu/
> > >> ++
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On 6/21/16, 12:13 AM, "Anthony Beylerian" <
> anthonybeyler...@hotmail.com
> > >
> > >> wrote:
> > >>
> > >> >+1
> > >> >
> > >> >Maybe we could put the results of the evaluator tests for each
> > component
> > >> somewhere on a webpage and on every release update them.
> > >> >This is of course provided there are reasonable data sets for testing
> > >> each component.
> > >> >What do you think?
> > >> >
> > >> >Anthony
> > >> >
> > >> >> From: mondher.bouaz...@gmail.com
> > >> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
> > >> >> Subject: Re: Performances of OpenNLP tools
> > >> >> To: dev@opennlp.apache.org
> > >> >>
> > >> >> Hi,
> > >> >>
> > >> >> Thank you for your replies.
> > >> >>
> > >> >> Please Jeffrey accept once more my apologies for receiving the
> email
> > >> twice.
> > >> >>
> > >> >> I also think it would be great to have such studies on the
> > >> performances of
> > >> >> OpenNLP.
> > >> >>
> > >> >> I have been looking for this information and checked in many
> places,
> > >> >> including obviously google scholar, and I haven't found any serious
> > >> studies
> > >> >> or reliable results. Most of the existing ones report the
> > performances
> > >> of
> > >> >> outdated releases of OpenNLP, and focus more on the execution time
> or
> > >> >> CPU/RAM consumption, etc.
> > >> >>
> > >> >> I think such a comparison will help not only evaluate the overall
> > >> accuracy,
> > >> >> but also highlight the issues with the existing models (as a matter
> > of
> > >> >> fact, the existing models fail to recognize many of the hashtags in
> > >> tweets:
> > >> >> the tokenizer splits them into the "#" symbol and a word that the
> PoS
> > >> >> tagger also fails to recognize).
> > >> >>
> > >> >> Therefore, building Twitter-based models would also be useful,
> since
> > >> many
> > >> >> of the works in academia / industry are focusing on Twitter data.
> > >> >>
> > >> >> Best regards,
> > >> >>
> > >> >> Mondher
> > >> >>
> > >> >>
> > >> >>
> > >> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <
> > >> jasonbaldri...@gmail.com>
> > >> >> wrote:
> > >> >>
> > >> >> > It would be fantastic to have these numbers. This is an example
> of
> > >> >> > something that would be a great contribution by someone trying to
> > >> >> > contribute to open source and who is maybe just getting into
> > machine
> > >> >> > learning and natural language processing.
> > >> >> >
> > >> >> > For Twitter-ish text, it'd be great to look at models trained and
> > >> evaluated
> > >> >> > on the Tw