[ 
https://issues.apache.org/jira/browse/LUCENE-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603864#comment-14603864
 ] 

Hoss Man commented on LUCENE-6624:
----------------------------------

(creating in response to a conversation i had earlier today that made me 
realize we still don't offer anything out of the box to really address this 
type of problem ... probably won't have time to get to it soon, but i wanted to 
file the issue with the overall gist of the goal/idea so it's actually written 
down somewhere)

> provide a BookendFilter to make the "exact match against an entire 
> (tokenized) field value" usecase easy
> --------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6624
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6624
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Hoss Man
>
> A question that seems to pop up every now and then is how to require an 
> "exact match" against "an entire field value" even while using some sort of 
> analysis feature (ie: stopwords, or lowercasing, or whitespace normalization).
> In other words: instead of a literal, byte for byte, "exact match" (eg: {{new 
> StringField(f, val, Store.NO)}} at index time; {{new TermQuery(new Term(f, 
> val))}} at query time) some folks want to use some Tokenizer and TokenFilter 
> but then require that a "PhraseQuery" (or SpanNearQuery) on the input matches 
> the entire field value, w/o any terms left over.
> Example: they want a (phrase) queries like {{"The Quick Brown Dog"}} and 
> {{"quick BROWN dog"}} to both match a document indexed with a field value 
> "{{The Quick Brown Dog.}}" because their analyzer tokenizes both the query & 
> the field value into {{quick | brown | dog}} (standard tokenizer + stopword & 
> lowercase filters) -- BUT -- on the other hand they don't want either of 
> those phrase queries to match a document with a field value of "{{I Love the 
> Quick Brown Dog}}" because that field value includes additional terms not 
> covered by the query.
> A suggestion i've seen for years in response to this type of question is that 
> folks can "inject marker tokens" at the begining and end of both the field 
> values & query, and then (as long as there is no "slop" on the phrase 
> queries) they should get the matches they expect.  The hackish way to do this 
> being to just prepend and append some strings that won' be found in their 
> data and won't be striped out by their tokenizer or any token filters (eg: 
> {{new TextField(f, "VAL_START_XYZABC " + val + " VAL_END_XYZABC", Store.NO)}} 
> at index time; {{queryBuilder.createPhraseQuery(f, "VAL_START_XYZABC " + val 
> + " VAL_END_XYZABC")}} at query time).
> Unless i'm missing something, it should be fairly trivial to write a 
> "BookendFilter" that that does this automatically for users:
> * the first time {{incrementToken()}} is called, produce a synthetic "start"  
> token with some CharTermAttribute that is uses a non-printing unicode 
> sequence (overridable by user config)
> * after that, all calls to {{incrementToken()}} proxy to the wrapped stream 
> until it's exhausted
> * after that, when {{incrementToken()}} is called, produce a synthetic "end" 
> token with some CharTermAttribute that is uses a non-printing unicode 
> sequence (overridable by user config)
> * both synthetic tokens should have KeywordAttribute == true
> ...At index time the sythetic tokens will be indexed as terms, and if the 
> same analyzer is used at query time to build a PhraseQuery those terms will 
> be the first and last terms in the PhraseQuery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to