[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456053#comment-13456053
 ] 

Lance Norskog edited comment on LUCENE-4345 at 9/15/12 10:20 AM:
-----------------------------------------------------------------

I recently did some related research in text analysis and found that limiting 
terms to nouns&verbs was a 10-15% increase in all variations of the test.

So, filtering terms from Parts-of-Speech annotation will be very helpful. In my 
OpenNLP patch is a FilterPayloadsFilter which keeps or rips out terms based on 
a list of text payloads.

[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]
                
      was (Author: lancenorskog):
    I recently did some related research in text analysis and found that 
limiting terms to nouns&verbs was a 10-15% increase in all variations of the 
test.

So, filtering terms from Parts-of-Speech annotation will be very helpful. In my 
OpenNLP patch is a FilterPayloadsFilter which keeps or rips out from a list of 
text payloads.

[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]
                  
> Create a Classification module
> ------------------------------
>
>                 Key: LUCENE-4345
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4345
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Tommaso Teofili
>            Priority: Minor
>         Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to