[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

Benson Margulies (JIRA) Tue, 12 Nov 2013 03:36:53 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820031#comment-13820031
 ]


Benson Margulies commented on LUCENE-2899:
------------------------------------------

I know of an NER model that looks at the entire text to bias towards consistent 
tagging of entities in larger units. However, I agree that crocks are bad. 
Perhaps this is an opportunity to think about how to expand the analysis 
protocol to support this sort of thing more smoothly?

It would be desirable if this integration were to start with a set of Token 
Attributes that could be used in any number of analysis components, inside or 
outside of Lucene, that were in a position to deliver similar items. I suppose 
I'm late to ask for this, as the UIMA component must pose the same question.

In some languages, NER is very clumsy as a token filter, because entities don't 
obey token boundaries very well. Also, in my experience, entities aren't useful 
as additional tokens in the same field as their source text, but rather in 
their own field (where they can be facetted upon, for example). Is there any 
appetite to look at Lucene support for a stream that delivers to more than one 
field? Or is there such a thing and I've missed it?

I agree with Rob about UIMA because I think that Lucene analysis attributes are 
a weak data model for interconnecting NLP modules and flowing data through them 
-- and one frequently needs to do that.



> Add OpenNLP Analysis capabilities as a module
> ---------------------------------------------
>
>                 Key: LUCENE-2899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2899
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 4.6
>
>         Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

Reply via email to