Practical usages of arbitrary Shingles when using a query parser?

Chris Hostetter Mon, 30 Jul 2018 15:46:54 -0700

Although I've been aware of Shings and some of the useful applications fora long time, today is the first tiem i really sat down and tried to dosomething non-trivial with them myself.

My objective seems realatively straight forard: given a corpus of text andsome analyzer (for sake of discussion let's assume simple whitespacetokenization w/lowercasing) i want to be able to say "I am happy to tradeindex time/size for faster queries of shorter phrases"

So instead of just indexing "the quick brown fox jumped over the lazy dog"as a field with 9 terms, I might want to add ShingleFilterFactory to theend of my analyzer using [[minShingleSize="2" maxShingleSize="2"outputUnigrams="true"]] and now I have a field w/17 terms, but if I get aquery for a "phrase" of 2 words/terms, i should in theory be able to justuse a TermQuery under the covers -- making just as "fast" as query for asingle word/term. But meanwhile longer phrases should still "just work"as if i didn't have any shingles.


So far so good...

If I actually index a corpus as described above, and then at query time Iuse ShingleFilterFactory w/ [[minShingleSize="2" maxShingleSize="2"outputUnigramsIfNoShingles="true" outputUnigrams="false"]] I get theexpected TemQuery for either a single word input or two-word input ...for input "phrases" longer then 2 terms I get a PhraseQuery -- albeit onecomposed of bi-shingles instead of individual unigrams, but AFAICT theposition info is set correctly so that it will only match the documentsthta would have been matched w/o any shingles (and IIUC the term statsfor the shingles seem like should probably result in subjectively "better"scores? not certain on this bit, but also not overly concerend about it)

The problem is that (unless I'm missing something) this doesn't reallywork if I want to use an arbitrary 'maxShingleSize="N"' where N>2.

If i change my index time ShingleFilterFactory uses [[minShingleSize="2"maxShingleSize="N" outputUnigrams="true"]] the equivilent change to thequery time analyzer would be [[minShingleSize="2" maxShingleSize="N"outputUnigramsIfNoShingles="true" outputUnigrams="false"]] -- and whilethat does seem to cause "phrase" input of all sizes to be converted by theanalyzer+QueryParser into a query that (AFAICT) will match the correctdocuments (compared to using no shingles) it's only "optimized" as aTermQuery for one & two word phrases. For input phrasees longer then 2terms it generates a SpanOrQuery wrapping multiple SpanNearQueries,i believe because of the overlapping positions of the bi/tri/quad-etc..shingles.

There just doesn't seem to be any good/generic way to leverage a fieldbuilt with an analyzer that uses [[minShingleSize="X" maxShingleSize="Y"]](where X != Y) at query time using an QueryParser configured with out ofthe box analyzer components.

It seems like what's missing is a ShingleFilter(Factory) configurationthat means "output the maximum possible shingle size between MIN andMAX based on the size of the input stream" ... but that doesn't seem toexist.

Does anyone have any advice/suggestions on how to approach this type ofproblem based on their own experiences? Does anyone have first handexperience using maxShingleSize > 2 with a QueryParser (and w/o anypreconcieved assumptions about the length of the input) ?


        ?

-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Practical usages of arbitrary Shingles when using a query parser?

Reply via email to