: The query parser is confused by these overlapping positions indeed, which : it interprets as synonyms. I was going to write that you should set the
Sure -- i'm not blaming the QueryParser, what it does with the Shingles output makes sense (and actual works! .. just not as efficiently as possible). I'm trying to figure out how to make the ShingleFilter output more useful in the query time analyzer usecase. : it interprets as synonyms. I was going to write that you should set the : same min and max shingle sizes at query time, but while writing that I : realized that you probably wanted to keep outputing shorter shingles so : that a phrase query on 2 terms with a max shingle size of 3 would still use Yes exactly ... if at index time you output both unigrams and shingles of sizes 2-5, and at query time you have a "phrase" of only 2 words, ideally the filter should output a simple Token so you can make a single TermQuery -- likewise if you have a phrase of 3 words, or 4, words, or 5 words thouse should ideally all produces single tokens. Your suggestion of "same min & max at query time" where min=max=X is something i briefly considered, but that means you're only optimizing the "phrases" of length "X", all shorter phrases just use unigrams, and in fact there is no point in building shingles of any size othe then X at index time. : shingles? Maybe 'outputUnigramsIfNoShingles' should really be something : like 'outputShinglesOfTheMaximumSizeOnly'? That's what i was thinking -- but i haven't dug into the code enough to understand how complex that would be. (i was starting with "Am i missing something about how/why this shouldn't/doesn't already exist?") : For the record, in addition to the problems that you mentioned, : ShingleFilter proved very hard to be fixed in order to work correctly on : top of synonyms when X != Y[1], which encouraged Alan work on a new : FixedShingleFilter[2] that deals with index-time synonyms (ie. ignores Yeah ... i can't even imagine the complexity of dealing with "graph" based synonyms and shinles (didn't read your link for fear of my own sanity) : position length) just fine but only allows X == Y. Also instead of feeding : an analyzer with shingles to the query parser, we found it more : user-friendly to add an option to text fields in order to index 2-shingles : into a separate field and redirect phrase queries to it.[3] We did Right ... i'm actually looking at a system know that puts uni-shingles, bi-shingles, and tri-shingles in 3 diff fields, and then pre-parses the input to figure out how long it is to decide which field to query ... i'm trying to simplify that. Ideally what I'd like to be able to say is "give me a phrase, if the field is configured w/o any shingles at all it will work fine (via PhraseQuery), but if the analyzer is configured with shingles it will be even faster (via term query) if/when the query phrase is "shorter" then the max shingles length. -Hoss http://www.lucidworks.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org