[ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Rowe updated LUCENE-2899: ------------------------------- Attachment: LUCENE-2899.patch Attaching another WIP patch with more progress: * Switched {{OpenNLPFilter}} to use {{TypeAttribute}} instead of {{PayloadAttribute}} to hold annotations from part-of-speech tagging, chunking and NER tagging. * Added a new {{TypeAsSynonymFilter}} to the analyzers-common module that adds a token at the same position as a (presumably previously annotated) token, with the value of the {{TypeAttribute}} copied into its {{CharTermAttribute}}. See [~sbower]'s comment above for potential uses. * Removed the now unnecessary {{FilterPayloadsFilter}} and {{StripPayloadFilter}} that were present in previous iterations of the patch. * Added {{KeywordAttribute}} awareness to {{OpenNLPLemmatizationFilter}}, so that lemmatization won't be performed on tokens with {{isKeyword()==true}}. * Fixed the new payload-aware {{BaseTokenStreamTestCase.assertTokenStreamContents()}} to use {{BytesRef.equals()}} instead of directly comparing {{byte}} arrays and not referencing offset&length. * Added {{TypeAttribute}} awareness to {{CannedTokenStream}}. > Add OpenNLP Analysis capabilities as a module > --------------------------------------------- > > Key: LUCENE-2899 > URL: https://issues.apache.org/jira/browse/LUCENE-2899 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 4.9, 6.0 > > Attachments: LUCENE-2899-6.1.0.patch, LUCENE-2899-RJN.patch, > LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, > OpenNLPFilter.java, OpenNLPTokenizer.java > > > Now that OpenNLP is an ASF project and has a nice license, it would be nice > to have a submodule (under analysis) that exposed capabilities for it. Drew > Farris, Tom Morton and I have code that does: > * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it > would have to change slightly to buffer tokens) > * NamedEntity recognition as a TokenFilter > We are also planning a Tokenizer/TokenFilter that can put parts of speech as > either payloads (PartOfSpeechAttribute?) on a token or at the same position. > I'd propose it go under: > modules/analysis/opennlp -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org