Although I've been aware of Shings and some of the useful applications for
a long time, today is the first tiem i really sat down and tried to do
something non-trivial with them myself.
My objective seems realatively straight forard: given a corpus of text and
some analyzer (for sake of discussion let's assume simple whitespace
tokenization w/lowercasing) i want to be able to say "I am happy to trade
index time/size for faster queries of shorter phrases"
So instead of just indexing "the quick brown fox jumped over the lazy dog"
as a field with 9 terms, I might want to add ShingleFilterFactory to the
end of my analyzer using [[minShingleSize="2" maxShingleSize="2"
outputUnigrams="true"]] and now I have a field w/17 terms, but if I get a
query for a "phrase" of 2 words/terms, i should in theory be able to just
use a TermQuery under the covers -- making just as "fast" as query for a
single word/term. But meanwhile longer phrases should still "just work"
as if i didn't have any shingles.
So far so good...
If I actually index a corpus as described above, and then at query time I
use ShingleFilterFactory w/ [[minShingleSize="2" maxShingleSize="2"
outputUnigramsIfNoShingles="true" outputUnigrams="false"]] I get the
expected TemQuery for either a single word input or two-word input ...
for input "phrases" longer then 2 terms I get a PhraseQuery -- albeit one
composed of bi-shingles instead of individual unigrams, but AFAICT the
position info is set correctly so that it will only match the documents
thta would have been matched w/o any shingles (and IIUC the term stats
for the shingles seem like should probably result in subjectively "better"
scores? not certain on this bit, but also not overly concerend about it)
The problem is that (unless I'm missing something) this doesn't really
work if I want to use an arbitrary 'maxShingleSize="N"' where N>2.
If i change my index time ShingleFilterFactory uses [[minShingleSize="2"
maxShingleSize="N" outputUnigrams="true"]] the equivilent change to the
query time analyzer would be [[minShingleSize="2" maxShingleSize="N"
outputUnigramsIfNoShingles="true" outputUnigrams="false"]] -- and while
that does seem to cause "phrase" input of all sizes to be converted by the
analyzer+QueryParser into a query that (AFAICT) will match the correct
documents (compared to using no shingles) it's only "optimized" as a
TermQuery for one & two word phrases. For input phrasees longer then 2
terms it generates a SpanOrQuery wrapping multiple SpanNearQueries,
i believe because of the overlapping positions of the bi/tri/quad-etc..
shingles.
There just doesn't seem to be any good/generic way to leverage a field
built with an analyzer that uses [[minShingleSize="X" maxShingleSize="Y"]]
(where X != Y) at query time using an QueryParser configured with out of
the box analyzer components.
It seems like what's missing is a ShingleFilter(Factory) configuration
that means "output the maximum possible shingle size between MIN and
MAX based on the size of the input stream" ... but that doesn't seem to
exist.
Does anyone have any advice/suggestions on how to approach this type of
problem based on their own experiences? Does anyone have first hand
experience using maxShingleSize > 2 with a QueryParser (and w/o any
preconcieved assumptions about the length of the input) ?
?
-Hoss
http://www.lucidworks.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org