Hoss Man created LUCENE-6624:
--------------------------------

             Summary: provide a BookendFilter to make the "exact match against 
an entire (tokenized) field value" usecase easy
                 Key: LUCENE-6624
                 URL: https://issues.apache.org/jira/browse/LUCENE-6624
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Hoss Man


A question that seems to pop up every now and then is how to require an "exact 
match" against "an entire field value" even while using some sort of analysis 
feature (ie: stopwords, or lowercasing, or whitespace normalization).

In other words: instead of a literal, byte for byte, "exact match" (eg: {{new 
StringField(f, val, Store.NO)}} at index time; {{new TermQuery(new Term(f, 
val))}} at query time) some folks want to use some Tokenizer and TokenFilter 
but then require that a "PhraseQuery" (or SpanNearQuery) on the input matches 
the entire field value, w/o any terms left over.

Example: they want a (phrase) queries like {{"The Quick Brown Dog"}} and 
{{"quick BROWN dog"}} to both match a document indexed with a field value 
"{{The Quick Brown Dog.}}" because their analyzer tokenizes both the query & 
the field value into {{quick | brown | dog}} (standard tokenizer + stopword & 
lowercase filters) -- BUT -- on the other hand they don't want either of those 
phrase queries to match a document with a field value of "{{I Love the Quick 
Brown Dog}}" because that field value includes additional terms not covered by 
the query.


A suggestion i've seen for years in response to this type of question is that 
folks can "inject marker tokens" at the begining and end of both the field 
values & query, and then (as long as there is no "slop" on the phrase queries) 
they should get the matches they expect.  The hackish way to do this being to 
just prepend and append some strings that won' be found in their data and won't 
be striped out by their tokenizer or any token filters (eg: {{new TextField(f, 
"VAL_START_XYZABC " + val + " VAL_END_XYZABC", Store.NO)}} at index time; 
{{queryBuilder.createPhraseQuery(f, "VAL_START_XYZABC " + val + " 
VAL_END_XYZABC")}} at query time).


Unless i'm missing something, it should be fairly trivial to write a 
"BookendFilter" that that does this automatically for users:

* the first time {{incrementToken()}} is called, produce a synthetic "start"  
token with some CharTermAttribute that is uses a non-printing unicode sequence 
(overridable by user config)
* after that, all calls to {{incrementToken()}} proxy to the wrapped stream 
until it's exhausted
* after that, when {{incrementToken()}} is called, produce a synthetic "end" 
token with some CharTermAttribute that is uses a non-printing unicode sequence 
(overridable by user config)
* both synthetic tokens should have KeywordAttribute == true

...At index time the sythetic tokens will be indexed as terms, and if the same 
analyzer is used at query time to build a PhraseQuery those terms will be the 
first and last terms in the PhraseQuery.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to