[jira] [Commented] (OPENNLP-777) Naive Bayesian Classifier

Cohan Sujay Carlos (JIRA) Fri, 18 Sep 2015 01:07:07 -0700

    [ 
https://issues.apache.org/jira/browse/OPENNLP-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14805176#comment-14805176
 ]


Cohan Sujay Carlos commented on OPENNLP-777:
--------------------------------------------

[~joern] and [~teofili],

There is another problem with the DocumentCategorizer, and that is in the 
nomenclature.

DocumentCategorizer is just the interface and there is no concrete 
implementation thereof at present.

So, if you look at the tutorials available on OpenNLP, including the 1.6.0 
manual
(https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.doccat.classifying.api)

You see that sample code tends to use DocumentCategorizerME explicitly.

The ME suffix seems to indicate Maximum Entropy.

So, wouldn't it be confusing for a user if they instantiated a subclass that 
was named Maximum Entropy, but if, owing to the setting of parameters, it used 
a Naive Bayes algorithm internally instead?

The 1.6.0 manual actually says:

{quote}
Document Categorizer API

To perform classification you will need a *maxent* model - these are 
encapsulated in the DoccatModel class of OpenNLP tools.

First you need to grab the bytes from the serialized model on an InputStream - 
we'll leave it you to do that, since you were the one who serialized it to 
begin with. Now for the easy part:
{quote}

And the code goes:

{code}
String inputText = ...
DocumentCategorizerME myCategorizer = new DocumentCategorierME(m);
double[] outcomes = myCategorizer.categorize(inputText);
String category = myCategorizer.getBestOutcome();
{code}

Wouldn't this necessitate the use of a different concrete subclass (i.e., 
DocumentCategorizerNB) to preserve backward compatibility? (Because users have 
already written code using DocumentCategorierME rendering a change of 
nomenclature of the concrete class inadvisable)?

> Naive Bayesian Classifier
> -------------------------
>
>                 Key: OPENNLP-777
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-777
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Machine Learning
>         Environment: J2SE 1.5 and above
>            Reporter: Cohan Sujay Carlos
>            Assignee: Tommaso Teofili
>            Priority: Minor
>              Labels: NBClassifier, bayes, bayesian, classifier, multinomial, 
> naive, patch
>         Attachments: D1TopicClassifierTrainingDemoNB.java, 
> D1TopicClassifierUsageDemoNB.java, NaiveBayesCorrectnessTest.java, 
> naive-bayesian-classifier-for-opennlp-1.6.0-rc6-with-test-cases.patch, 
> prep-attach-test-case-for-naive-bayesian-classifier-for-opennlp-1.6.0-rc6.patch,
>  topics.train
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> I thought it would be nice to have a Naive Bayesian classifier in OpenNLP (it 
> lacks one at present).
> Implementation details:  We have a production-hardened piece of Java code for 
> a multinomial Naive Bayesian classifier (with default Laplace smoothing) that 
> we'd like to contribute.  The code is Java 1.5 compatible.  I'd have to write 
> an adapter to make the interface compatible with the ME classifier in 
> OpenNLP.  I expect the patch to be available 1 to 3 weeks from now.
> Below is the email trail of a discussion in the dev mailing list around this 
> dated May 19th, 2015.
> <snip>
> Tommaso Teofili via opennlp.apache.org 
> to dev 
> Hi Cohan,
> I think that'd be a very valuable contribution, as NB is one of the
> foundation algorithms, often used as basis for comparisons.
> It would be good if you could create a Jira issue and provide more details
> about the implementation and, eventually, a patch.
> Thanks and regards,
> Tommaso
> </snip>
> 2015-05-19 9:57 GMT+02:00 Cohan Sujay Carlos 
> > I have a question for the OpenNLP project team.
> >
> > I was wondering if there is a Naive Bayesian classifier implementation in
> > OpenNLP that I've not come across, or if there are plans to implement one.
> >
> > If it is the latter, I should love to contribute an implementation.
> >
> > There is an ME classifier already available in OpenNLP, of course, but I
> > felt that there was an unmet need for a Naive Bayesian (NB) classifier
> > implementation to be offered as well.
> >
> > An NB classifier could be bootstrapped up with partially labelled training
> > data as explained in the Nigam, McCallum, et al paper of 2000 "Text
> > Classification from Labeled and Unlabeled Documents using EM".
> >
> > So, if there isn't an NB code base out there already, I'd be happy to
> > contribute a very solid implementation that we've used in production for a
> > good 5 years.
> >
> > I'd have to adapt it to load the same training data format as the ME
> > classifier, but I guess that shouldn't be very difficult to do.
> >
> > I was wondering if there was some interest in adding an NB implementation
> > and I'd love to know who could I coordinate with if there is?
> >
> > Cohan Sujay Carlos
> > CEO, Aiaioo Labs, India



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OPENNLP-777) Naive Bayesian Classifier

Reply via email to