[jira] [Updated] (LUCENE-602) [PATCH] Filtering tokens for position and term vector storage

JIRA Thu, 28 Feb 2013 05:05:30 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jan Høydahl updated LUCENE-602:
-------------------------------


This issue has been inactive for more than 4 years. Please close if it's no 
longer relevant/needed, or bring it up to date if you intend to work on it. 
SPRING_CLEANING_2013
                
> [PATCH] Filtering tokens for position and term vector storage
> -------------------------------------------------------------
>
>                 Key: LUCENE-602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-602
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 2.1
>            Reporter: Chuck Williams
>            Priority: Minor
>         Attachments: TokenSelectorAllWithParallelWriter.patch, 
> TokenSelectorSoloAll.patch
>
>
> This patch provides a new TokenSelector mechanism to select tokens of 
> interest and creates two new IndexWriter configuration parameters:  
> termVectorTokenSelector and positionsTokenSelector.
> termVectorTokenSelector, if non-null, selects which index tokens will be 
> stored in term vectors.  If positionsTokenSelector is non-null, then any 
> tokens it rejects will have only their first position stored in each document 
> (it is necessary to store one position to keep the doc freq properly to avoid 
> the token being garbage collected in merges).
> This mechanism provides a simple solution to the problem of minimzing index 
> size overhead cause by storing extra tokens that facilitate queries, in those 
> cases where the mere existence of the extra tokens is sufficient.  For 
> example, in my test data using reverse tokens to speed prefix wildcard 
> matching, I obtained the following index overheads:
>   1.  With no TokenSelectors:  60% larger with reverse tokens than without
>   2.  With termVectorTokenSelector rejecting reverse tokens:  36% larger
>   3.  With both positionsTokenSelector and termVectorTokenSelector rejecting 
> reverse tokens:  25% larger
> It is possible to obtain the same effect by using a separate field that has 
> one occurrence of each reverse token and no term vectors, but this can be 
> hard or impossible to do and a performance problem as it requires either 
> rereading the content or storing all the tokens for subsequent processing.
> The solution with TokenSelectors is very easy to use and fast.
> Otis, thanks for leaving a comment in QueryParser.jj with the correct 
> production to enable prefix wildcards!  With this, it is a straightforward 
> matter to override the wildcard query factory method and use reverse tokens 
> effectively.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-602) [PATCH] Filtering tokens for position and term vector storage

Reply via email to