[ https://issues.apache.org/jira/browse/OPENNLP-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Martin Wiesner updated OPENNLP-1307: ------------------------------------ Fix Version/s: 2.1.1 > Incorrect code example for Document Categorization (9.3) > -------------------------------------------------------- > > Key: OPENNLP-1307 > URL: https://issues.apache.org/jira/browse/OPENNLP-1307 > Project: OpenNLP > Issue Type: Documentation > Components: Doccat > Affects Versions: 1.9.3 > Environment: N/A > Reporter: John Slocum > Assignee: Martin Wiesner > Priority: Major > Labels: DocumentCategorizerME, documentation > Fix For: 2.1.1 > > Original Estimate: 2m > Remaining Estimate: 2m > > in > [https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html#tools.doccat.classifying.api,] > the code example feeds a String into DocumentCategorizerME.categorize(). The > method itself takes an array. I flagged priority as Major because this was a > killer - obviously it's a self-documenting bug when you run it, but I made > the mistake of assuming that the array actually needed would be an array of > documents - instead it needs to be an array of tokens from a single document, > i.e. one needs to split() the doc on whitespace. Lost 24 hours experimenting > with algos (maxent vs. naive_bayes) and params (cutoff, iterations, etc) > before figuring this one out. > > Current(wrong) version: > > {code:java} > String inputText = ... > DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m); > double[] outcomes = myCategorizer.categorize(inputText); > String category = myCategorizer.getBestCategory(outcomes); > {code} > > Should be more like: > > {code:java} > String inputText = ... // sanitized document to be categorized > DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m); > double[] outcomes = myCategorizer.categorize(inputText.split(" "); > String category = myCategorizer.getBestCategory(outcomes); > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)