[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290015#comment-13290015
 ] 

Lance Norskog commented on LUCENE-2899:
---------------------------------------

Notes for a Wiki page:

OpenNLP Integration

What is the integration? The first integration is a Tokenizer and three 
Filters. 
* The OpenNLPTokenizer uses the OpenNLP SentenceDetector and Tokenizer tools 
instead of the standard Lucene Tokenizers.  This requires statistical model 
files. One quirk of these is that all punctuation is maintained. 
* The OpenNLPFilter implements Parts-of-Speech tagging, Chunking (finding 
noun/verb phrases), and Named Entity Recognition (tagging people, place names 
etc.). This filter will add all tags as payload attributes to the tokens.
* The FilterPayloadsFilter removes tokens by checking the payloads. Given a 
list of payloads, it will either keep only tokens with one of those payloads, 
or remove only matching tokens and keep the rest. (This filter maintains 
position increments correctly.)
* The StripPayloadsFilter removes payloads from Tokens. 

How do I get going?
* pull the latest trunk
* apply the patch
* download these models to contrib/opennlp/src/test-* 
files/opennlp/solr/conf/opennlp/
** [http://opennlp.sourceforge.net/models-1.5/]
** Everything that starts with 'en'
* download the OpenNLP distribution from 
[http://opennlp.apache.org/cgi-bin/download.cgi]
** Currently it is apache-opennlp-1.5.2-incubating-bin.tar.gz
* unpack this and copy the jar files from lib/ to
solr/contrib/opennlp/lib

Now, go to trunk-dir/solr and run 'ant test-contrib'. It compiles against the 
libraries and uses the model files. 
Next, run 'ant example', cd to the example directory and run 'java 
-Dsolr.solr.home=opennlp -jar start.jar'
You now should start without any Exceptions. At this point, go to the Schema 
analyzer, pick the 'text_opennlp_pos' field type, and post a sentence or two to 
the analyzer. You should get text tokenized with payloads. Unfortunately, the 
analysis page shows them as bytes instead of text. If you would like this, then 
go vote on [SOLR-3493].


                
> Add OpenNLP Analysis capabilities as a module
> ---------------------------------------------
>
>                 Key: LUCENE-2899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2899
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to