Although I've been aware of Shings and some of the useful applications for a long time, today is the first tiem i really sat down and tried to do something non-trivial with them myself.

My objective seems realatively straight forard: given a corpus of text and some analyzer (for sake of discussion let's assume simple whitespace tokenization w/lowercasing) i want to be able to say "I am happy to trade index time/size for faster queries of shorter phrases"

So instead of just indexing "the quick brown fox jumped over the lazy dog" as a field with 9 terms, I might want to add ShingleFilterFactory to the end of my analyzer using [[minShingleSize="2" maxShingleSize="2" outputUnigrams="true"]] and now I have a field w/17 terms, but if I get a query for a "phrase" of 2 words/terms, i should in theory be able to just use a TermQuery under the covers -- making just as "fast" as query for a single word/term. But meanwhile longer phrases should still "just work" as if i didn't have any shingles.

So far so good...

If I actually index a corpus as described above, and then at query time I use ShingleFilterFactory w/ [[minShingleSize="2" maxShingleSize="2" outputUnigramsIfNoShingles="true" outputUnigrams="false"]] I get the expected TemQuery for either a single word input or two-word input ... for input "phrases" longer then 2 terms I get a PhraseQuery -- albeit one composed of bi-shingles instead of individual unigrams, but AFAICT the position info is set correctly so that it will only match the documents thta would have been matched w/o any shingles (and IIUC the term stats for the shingles seem like should probably result in subjectively "better" scores? not certain on this bit, but also not overly concerend about it)

The problem is that (unless I'm missing something) this doesn't really work if I want to use an arbitrary 'maxShingleSize="N"' where N>2.

If i change my index time ShingleFilterFactory uses [[minShingleSize="2" maxShingleSize="N" outputUnigrams="true"]] the equivilent change to the query time analyzer would be [[minShingleSize="2" maxShingleSize="N" outputUnigramsIfNoShingles="true" outputUnigrams="false"]] -- and while that does seem to cause "phrase" input of all sizes to be converted by the analyzer+QueryParser into a query that (AFAICT) will match the correct documents (compared to using no shingles) it's only "optimized" as a TermQuery for one & two word phrases. For input phrasees longer then 2 terms it generates a SpanOrQuery wrapping multiple SpanNearQueries, i believe because of the overlapping positions of the bi/tri/quad-etc.. shingles.

There just doesn't seem to be any good/generic way to leverage a field built with an analyzer that uses [[minShingleSize="X" maxShingleSize="Y"]] (where X != Y) at query time using an QueryParser configured with out of the box analyzer components.

It seems like what's missing is a ShingleFilter(Factory) configuration that means "output the maximum possible shingle size between MIN and MAX based on the size of the input stream" ... but that doesn't seem to exist.

Does anyone have any advice/suggestions on how to approach this type of problem based on their own experiences? Does anyone have first hand experience using maxShingleSize > 2 with a QueryParser (and w/o any preconcieved assumptions about the length of the input) ?

        ?

-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to