[jira] Commented: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text

Karl Wettin (JIRA) Sat, 31 May 2008 03:38:17 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601373#action_12601373
 ]


Karl Wettin commented on LUCENE-725:
------------------------------------

If you hang on for a week I too will be taking a closer look at this code.

http://www.nabble.com/Clustering-Demo-tt17127240.html#a17449440


> NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all 
> "boilerplate" text
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-725
>                 URL: https://issues.apache.org/jira/browse/LUCENE-725
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Mark Harwood
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: NovelAnalyzer.java, NovelAnalyzer.java
>
>
> This is a class I have found to be useful for analyzing small (in the 
> hundreds) collections of documents and  removing any duplicate content such 
> as standard disclaimers or repeated text in an exchange of  emails.
> This has applications in sampling query results to identify key phrases, 
> improving speed-reading of results with similar content (eg email 
> threads/forum messages) or just removing duplicated noise from a search index.
> To be more generally useful it needs to scale to millions of documents - in 
> which case an alternative implementation is required. See the notes in the 
> Javadocs for this class for more discussion on this

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text

Reply via email to