[PATCH] Filtering tokens for position and term vector storage
-------------------------------------------------------------

         Key: LUCENE-602
         URL: http://issues.apache.org/jira/browse/LUCENE-602
     Project: Lucene - Java
        Type: New Feature

  Components: Index  
    Versions: 2.1    
    Reporter: Chuck Williams
 Attachments: TokenSelectorSoloAll.patch

This patch provides a new TokenSelector mechanism to select tokens of interest 
and creates two new IndexWriter configuration parameters:  
termVectorTokenSelector and positionsTokenSelector.

termVectorTokenSelector, if non-null, selects which index tokens will be stored 
in term vectors.  If positionsTokenSelector is non-null, then any tokens it 
rejects will have only their first position stored in each document (it is 
necessary to store one position to keep the doc freq properly to avoid the 
token being garbage collected in merges).

This mechanism provides a simple solution to the problem of minimzing index 
size overhead cause by storing extra tokens that facilitate queries, in those 
cases where the mere existence of the extra tokens is sufficient.  For example, 
in my test data using reverse tokens to speed prefix wildcard matching, I 
obtained the following index overheads:

  1.  With no TokenSelectors:  60% larger with reverse tokens than without
  2.  With termVectorTokenSelector rejecting reverse tokens:  36% larger
  3.  With both positionsTokenSelector and termVectorTokenSelector rejecting 
reverse tokens:  25% larger

It is possible to obtain the same effect by using a separate field that has one 
occurrence of each reverse token and no term vectors, but this can be hard or 
impossible to do and a performance problem as it requires either rereading the 
content or storing all the tokens for subsequent processing.

The solution with TokenSelectors is very easy to use and fast.

Otis, thanks for leaving a comment in QueryParser.jj with the correct 
production to enable prefix wildcards!  With this, it is a straightforward 
matter to override the wildcard query factory method and use reverse tokens 
effectively.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to