http://opennlp.apache.org/docs/1.8.0/manual/opennlp.html#tools.doccat

On Thu, Jun 29, 2017 at 1:51 PM, Chris Mattmann <[email protected]> wrote:

> Thanks I will investigate the below thanks Joern. Can someone send me some
> pointers
> to the Doc Cat API that I can find? Thanks.
>
>
>
>
> On 6/29/17, 10:18 AM, "Joern Kottmann" <[email protected]> wrote:
>
>     For 2. I would like to suggest that we implement doccat format support
>     to train on that data.
>
>     3. it would be best so think about how we want to test the doccat
>     component, today we don't have any tests which use lots of data to
>     evaluate it.
>     Probably the sentitment data could solve this for us and a train and
>     evaluate test could be included in the eval tests.
>
>     +1 to revert and then do these steps after the 1.8.1 release.
>
>     I can apply my PR myself if nobody objects.
>
>     Jörn
>
>     On Thu, Jun 29, 2017 at 7:10 PM, Chris Mattmann <[email protected]>
> wrote:
>     > Hi Rodrigo,
>     >
>     > This is very useful feedback that I wish we would have had a long
> time ago.
>     >
>     > I will look into it and see if I can reproduce the CLI error. I did
> a full build and mvn
>     > install (which I though would run tests?) before commiting and as I
> posted in JIRA
>     > the tests passed for me? So I will have to look into that.
>     >
>     > That said, given your feedback that SentimentME and the Sentiment
> Component
>     > doesn’t offer much over Document Classifier I agree with you, but
> wasn’t super
>     > familiar with the Document Classifier API. That said, if we can get
> the same functionality
>     > by just using Document Classifier why don’t we:
>     >
>     > 1. Remove the SentimentME and associated code (except for the unit
> tests)
>     > 2. Use the sample datasets from NetFlix & Stanford Treebank
> sentiment and
>     > build models using Document Classifier API.
>     > 3. Rename and keep the unit tests that test against Netflix and
> Stanford tree bank.
>     >
>     > That way we get basic sentiment analysis (that is working for us
> internally at JPL decently),
>     > for Apache OpenNLP, and then if we want to build something better
> than a Document
>     > Classification approach to sentiment we can do so.
>     >
>     > Thoughts?
>     >
>     > Thanks for your useful feedback. If everyone agrees this is a plan I
> can back out the code
>     > using Joern’s revert, and then try and execute 1-3 above in a branch
> first. Thanks.
>     >
>     > Cheers,
>     > Chris
>     >
>     >
>     >
>     > On 6/29/17, 10:03 AM, "Rodrigo Agerri" <[email protected]> wrote:
>     >
>     >     Hi Chris,
>     >
>     >     I have been interested in the new sentiment component for a
> while,
>     >     although truth to be told, I did not follow that closely. I have
> today
>     >     looked at it and test it with some of the corpora you have
> mentioned.
>     >     In order to do that, I checkout master to work with from this
> commit
>     >     onwards
>     >
>     >     https://github.com/apache/opennlp/commit/
> 56321aab51a470cd2004b76fb1f5330881b943c1
>     >
>     >     1. I tried to run it from the CLI. The Sentiment component did
> not
>     >     appear to be available.
>     >     2. I added the SentimentTrainer and Evaluator to the cmdline.CLI
> (no
>     >     SentimentTool is implemented to tag with a trained model).
>     >     3. After that, the CLI tests did not pass. So, the CLI is
> currently
>     >     non functional, unless I did something wrong, always possible, of
>     >     course. See if you can reproduce that error.
>     >
>     >     I therefore did the tests via API. I implemented a little test
> for
>     >     training, evaluating and tagging here:
>     >
>     >     https://github.com/ixa-ehu/ixa-pipe-doc/tree/test
>     >
>     >     I run the training on the large movies review from Stanford for
> binary
>     >     polarity classification
>     >
>     >     http://ai.stanford.edu/~amaas/data/sentiment/
>     >
>     >     and on the two little samples multiclass files added in
> resources and
>     >     mentioned in the previous email, using the first one for
> training and
>     >     the second one for testing (maxent 100 iterations, cutoff 5).
>     >
>     >     2. Stanford results: 0.84264
>     >     3. sample multiclass: 0.73
>     >
>     >     Given that this is a standard document classification task, I
> decided
>     >     to train the doccat component from the CLI:
>     >
>     >     1. Stanford results: 0.84264 (BOW features by default).
>     >     2. sample multiclass: 0.73
>     >
>     >     I then looked at the code of the sentiment component and saw
> that it
>     >     is basically a document classifier working with bag of words
> features.
>     >     No added functionality. So, my conclusions are:
>     >
>     >     1. The CLI needs to be fixed.
>     >     2. The Sentiment component, as it is, provides the same
> functionality
>     >     as the document classifier.
>     >
>     >     I would therefore reconsider this commit until those two issues
> are
>     >     addressed. Just my opinion.
>     >
>     >     Best regards,
>     >
>     >     Rodrigo
>     >
>     >     On Thu, Jun 29, 2017 at 5:30 PM, Chris Mattmann <
> [email protected]> wrote:
>     >     >
>     >     > Hey Joern,
>     >     >
>     >     > Sure, you can find the model data links here, along with our
> evaluation of them.
>     >     >
>     >     > http://irds.usc.edu/SentimentAnalysisParser/datasets.html
>     >     >
>     >     > There are other evaluations here:
>     >     >
>     >     > http://irds.usc.edu/SentimentAnalysisParser/models.html
>     >     >
>     >     > The HT provider review I cannot contribute at this time and I
> question its broad
>     >     > applicability since it’s related to human trafficking. In
> addition we are still working
>     >     > on publishing our analysis & evaluation of it which is why I
> removed it from the
>     >     > commit.
>     >     >
>     >     > Cheers,
>     >     > Chris
>     >     >
>     >     >
>     >     >
>     >     >
>     >     >
>     >     > On 6/29/17, 7:36 AM, "Joern Kottmann" <[email protected]>
> wrote:
>     >     >
>     >     >     Which data sets did you use to evaluate this?
>     >     >     I was looking for a bit more than a sample file to train
> it.
>     >     >
>     >     >     I noticed that you checked in stanford and netflix models.
>     >     >
>     >     >     The stanford data set is probably this one:
>     >     >     http://ai.stanford.edu/~amaas/data/sentiment/
>     >     >
>     >     >     Do you have a link for the netflix data?
>     >     >
>     >     >     Jörn
>     >     >
>     >     >     On Thu, Jun 29, 2017 at 4:00 PM, Chris Mattmann <
> [email protected]> wrote:
>     >     >     > Absolutely you can find it here:
>     >     >     >
>     >     >     > 
> opennlp-tools/src/test/resources/opennlp/tools/sentiment/sample_train_categ
> (for categorical /multi-class)
>     >     >     > 
> opennlp-tools/src/test/resources/opennlp/tools/sentiment/sample_train_categ2
> (for categorical/multi-class)
>     >     >     >
>     >     >     > We can also do similar files where instead of
> multi-class, we just use pos/neg as the label.
>     >     >     >
>     >     >     > Cheers,
>     >     >     > Chris
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     > On 6/29/17, 2:35 AM, "Joern Kottmann" <
> [email protected]> wrote:
>     >     >     >
>     >     >     >     Hello Chris,
>     >     >     >
>     >     >     >     could you please point me to files I can use to
> train the sentiment
>     >     >     >     component? I am currently looking again through the
> code and would
>     >     >     >     like to train it myself.
>     >     >     >
>     >     >     >     Jörn
>     >     >     >
>     >     >     >     On Tue, Jun 27, 2017 at 4:59 PM, Dan Russ <
> [email protected]> wrote:
>     >     >     >     > Hi All,
>     >     >     >     >    First, let me take a share of blame for the
> comment Chris mentioned.  I believe I said something like the pull request
> was X revision behind and Y revisions ahead.  It was not meant to be rude,
> it was meant to say it is hard to review code when it is so different from
> the current code base. I am very excited that sentiment analysis is going
> to be added to OpenNLP, but I have not had time to play with it. If I were
> to say “great job” before I have add a chance to look at it, it would be
> flattery not honest praise.
>     >     >     >     >
>     >     >     >     >   Let’s clean up the merge.  I agree with Chris
> that scalability and perfection should not be our initial goal.  Let’s get
> something, and we can decide how to optimize later (even if it require a
> complete rewrite).  Perfection is the enemy of the good.
>     >     >     >     >
>     >     >     >     >   Finally, because of Chris’ comments it is hard
> to thank Ana and Chris without sounding insincere.  But I’ll try, thank you
> Chris and Ana.  I hope we can get beyond this and that Chris and Ana will
> continue to improve the performance of the sentiment analysis tool and
> happily remain part of the OpenNLP family.  It is also a good time to toss
> a big thank you to all of the committers, users, and PMC member.  I use
> OpenNLP almost everyday.  Your work is extremely valuable to me.
>     >     >     >     >
>     >     >     >     > Thank you,
>     >     >     >     > Daniel
>     >     >     >     >
>     >     >     >     >> On Jun 27, 2017, at 10:25 AM, Chris Mattmann <
> [email protected]> wrote:
>     >     >     >     >>
>     >     >     >     >> Hi everyone,
>     >     >     >     >>
>     >     >     >     >> I spoke with Joern in Slack. Some of his concerns
> are:
>     >     >     >     >>
>     >     >     >     >> 1. This was done with a Merge commit and
> apparently they squash and rebase.
>     >     >     >     >> [would be helpful to see some pointer on this for
> documentation, thus far I
>     >     >     >     >> haven’t found any]
>     >     >     >     >> 2. Apparently we literally need to ask others for
> +1 votes and record them
>     >     >     >     >> before committing? I thought since Ana and I are
> committers aren were +1,
>     >     >     >     >> and since Joern had been providing feedback (the
> last of which was to add
>     >     >     >     >> tests, which we did) that he would be +1 as well
> (I guess he is not, and I guess
>     >     >     >     >> formally we need to do a +1 vote even still)
>     >     >     >     >> 3. There was concern about scalability of the
> code.
>     >     >     >     >> 4. There are thoughts that the code was not
> perfect yet (even though it works
>     >     >     >     >> fine in the MEMEX project for Ana and I)
>     >     >     >     >>
>     >     >     >     >> So, Joern has opened up a revert PR.
>     >     >     >     >>
>     >     >     >     >> I suppose I should state I find this process
> extremely heavyweight and unwelcoming.
>     >     >     >     >> To me, there should be a modicum of trust for
> committers, but I feel like even as a
>     >     >     >     >> committer, I am operating as a “contributor” to
> the project. Committer means that
>     >     >     >     >> there is trust to modify the source code base. Of
> the issues above, the only one I see
>     >     >     >     >> as a moderate snafu was #1, and frankly if there
> are some instructions that show me
>     >     >     >     >> how to do squashing and rebasing *first* I will
> try to do that in the future since I am
>     >     >     >     >> not a GIt expert.
>     >     >     >     >>
>     >     >     >     >> That said, I must state I feel pretty put off by
> Apache OpenNLP. This originated as a GSoC
>     >     >     >     >> effort, and we have worked pretty consistently on
> this over the last year. We used a
>     >     >     >     >> separate GitHub project to get started, kept
> Joern involved as another mentor, even
>     >     >     >     >> provided access and commit writes to that GitHub
> repository for a long time, so this
>     >     >     >     >> code was developed in the open. Joern even
> created a branch in ApacheOpenNLP in the code and I suppose
>     >     >     >     >> I should have gone and worked on that branch
> first since master is apparently so
>     >     >     >     >> pristine that even an Apache veteran like me
> can’t get something in to it without
>     >     >     >     >> making a whole bunch of (what are IMO minor
> issues, and what are IMO heavyweight
>     >     >     >     >> “community” issues).
>     >     >     >     >>
>     >     >     >     >> I am concerned from a community point of view
> that the first comment wasn’t “Great
>     >     >     >     >> job Chris, you got Sentiment Analysis into
> Apache, *but* I have these concerns 1-4 above”.
>     >     >     >     >> It was “The PR was merged wrong in ways 1-4 and
> I’m going to revert it.”
>     >     >     >     >>
>     >     >     >     >> That’s pretty off-putting to someone who is
> semi-new like me and like Ana.
>     >     >     >     >>
>     >     >     >     >> Anyways, go ahead and revert it. Sorry to have
> caused any issues.
>     >     >     >     >>
>     >     >     >     >> Chris
>     >     >     >     >>
>     >     >     >     >>
>     >     >     >     >>
>     >     >     >     >> On 6/27/17, 7:06 AM, "Chris Mattmann" <
> [email protected]> wrote:
>     >     >     >     >>
>     >     >     >     >>    Hi Joern,
>     >     >     >     >>
>     >     >     >     >>    I’m confused. Why did you revert my commit?
>     >     >     >     >>
>     >     >     >     >>    Every one of those check points you put on the
> PR was checked?
>     >     >     >     >>    We have been discussing this for months, you
> have seen the
>     >     >     >     >>    code for months, Ana and I have worked
> diligently on the code
>     >     >     >     >>    in plain view of everyone.
>     >     >     >     >>
>     >     >     >     >>    Please explain.
>     >     >     >     >>
>     >     >     >     >>    Chris
>     >     >     >     >>
>     >     >     >     >>
>     >     >     >     >>
>     >     >     >     >>
>     >     >     >     >>    On 6/27/17, 1:23 AM, "kottmann" <
> [email protected]> wrote:
>     >     >     >     >>
>     >     >     >     >>        GitHub user kottmann opened a pull request:
>     >     >     >     >>
>     >     >     >     >>            https://github.com/apache/
> opennlp/pull/238
>     >     >     >     >>
>     >     >     >     >>            Revert merging of sentiment work, no
> consent to merge it
>     >     >     >     >>
>     >     >     >     >>            Thank you for contributing to Apache
> OpenNLP.
>     >     >     >     >>
>     >     >     >     >>            In order to streamline the review of
> the contribution we ask you
>     >     >     >     >>            to ensure the following steps have
> been taken:
>     >     >     >     >>
>     >     >     >     >>            ### For all changes:
>     >     >     >     >>            - [ ] Is there a JIRA ticket
> associated with this PR? Is it referenced
>     >     >     >     >>                 in the commit message?
>     >     >     >     >>
>     >     >     >     >>            - [ ] Does your PR title start with
> OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay
> particular attention to the hyphen "-" character.
>     >     >     >     >>
>     >     >     >     >>            - [ ] Has your PR been rebased against
> the latest commit within the target branch (typically master)?
>     >     >     >     >>
>     >     >     >     >>            - [ ] Is your initial contribution a
> single, squashed commit?
>     >     >     >     >>
>     >     >     >     >>            ### For code changes:
>     >     >     >     >>            - [ ] Have you ensured that the full
> suite of tests is executed via mvn clean install at the root opennlp folder?
>     >     >     >     >>            - [ ] Have you written or updated unit
> tests to verify your changes?
>     >     >     >     >>            - [ ] If adding new dependencies to
> the code, are these dependencies licensed in a way that is compatible for
> inclusion under [ASF 2.0](http://www.apache.org/
> legal/resolved.html#category-a)?
>     >     >     >     >>            - [ ] If applicable, have you updated
> the LICENSE file, including the main LICENSE file in opennlp folder?
>     >     >     >     >>            - [ ] If applicable, have you updated
> the NOTICE file, including the main NOTICE file found in opennlp folder?
>     >     >     >     >>
>     >     >     >     >>            ### For documentation related changes:
>     >     >     >     >>            - [ ] Have you ensured that format
> looks appropriate for the output in which it is rendered?
>     >     >     >     >>
>     >     >     >     >>            ### Note:
>     >     >     >     >>            Please ensure that once the PR is
> submitted, you check travis-ci for build issues and submit an update to
> your PR as soon as possible.
>     >     >     >     >>
>     >     >     >     >>
>     >     >     >     >>        You can merge this pull request into a Git
> repository by running:
>     >     >     >     >>
>     >     >     >     >>            $ git pull
> https://github.com/kottmann/opennlp revert_sentiment
>     >     >     >     >>
>     >     >     >     >>        Alternatively you can review and apply
> these changes as the patch at:
>     >     >     >     >>
>     >     >     >     >>            https://github.com/apache/
> opennlp/pull/238.patch
>     >     >     >     >>
>     >     >     >     >>        To close this pull request, make a commit
> to your master/trunk branch
>     >     >     >     >>        with (at least) the following in the
> commit message:
>     >     >     >     >>
>     >     >     >     >>            This closes #238
>     >     >     >     >>
>     >     >     >     >>        ----
>     >     >     >     >>        commit 123222eb34724bae793e9d6d22e202
> c0aee0aa45
>     >     >     >     >>        Author: Jörn Kottmann <[email protected]>
>     >     >     >     >>        Date:   2017-06-27T08:19:19Z
>     >     >     >     >>
>     >     >     >     >>            Revert merging of sentiment work, no
> consent to merge it
>     >     >     >     >>
>     >     >     >     >>        ----
>     >     >     >     >>
>     >     >     >     >>
>     >     >     >     >>        ---
>     >     >     >     >>        If your project is set up for it, you can
> reply to this email and have your
>     >     >     >     >>        reply appear on GitHub as well. If your
> project does not have this feature
>     >     >     >     >>        enabled and wishes so, or if the feature
> is enabled but not working, please
>     >     >     >     >>        contact infrastructure at
> [email protected] or file a JIRA ticket
>     >     >     >     >>        with INFRA.
>     >     >     >     >>        ---
>     >     >     >     >>
>     >     >     >     >>
>     >     >     >     >>
>     >     >     >     >>
>     >     >     >     >>
>     >     >     >     >>
>     >     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >     >
>     >
>     >
>     >
>
>
>
>

Reply via email to