[jira] Commented: (LUCENE-494) Analyzer for preventing overload of search service by queries with common terms in large indexes

Mark Harwood (JIRA) Mon, 14 Jan 2008 15:56:59 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558854#action_12558854
 ]


Mark Harwood commented on LUCENE-494:
-------------------------------------

I personally don't use this but others may. It was easier to solve my 
particular problem by adding stop words to my XSL query templates (I added 
support to the XMLQueryParser for the "FuzzyLikeThisQuery" tag to take stop 
words). This was more about ease of configuration in my particular app.

I know Nutch has something similar implemented elsewhere - maybe in the query 
parser.

I also had the notion that wrapping IndexReader to auto-cache TermDocs for 
super-popular terms using a BitSet would be a good way to avoid the IO 
overhead. This Bitset wouldn't help resolve positional queries e.g. phrase/span 
queries which need a TermPositions implementation but would work for straight 
TermQueries.



> Analyzer for preventing overload of search service by queries with common 
> terms in large indexes
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-494
>                 URL: https://issues.apache.org/jira/browse/LUCENE-494
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Mark Harwood
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: QueryAutoStopWordAnalyzer.java, 
> QueryAutoStopWordAnalyzerTest.java
>
>
> An analyzer used primarily at query time to wrap another analyzer and provide 
> a layer of protection
> which prevents very common words from being passed into queries. For very 
> large indexes the cost
> of reading TermDocs for a very common word can be  high. This analyzer was 
> created after experience with
> a 38 million doc index which had a term in around 50% of docs and was causing 
> TermQueries for 
> this term to take 2 seconds.
> Use the various "addStopWords" methods in this class to automate the 
> identification and addition of 
> stop words found in an already existing index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-494) Analyzer for preventing overload of search service by queries with common terms in large indexes

Reply via email to