CommonGrams provides a neat trick for optimizing slow phrase queries that contain common words. (E.g. Hathi Trust has some data<http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2>showing how effective this can be.) Unfortunately, it does nothing for other position-based queries (e.g. SpanQueries or sloppy phrase queries); if these latter types of query contain an extremely common words, then they can continue to run quite slowly.
The general lesson I take from CommonGrams is that by precomputing structures that capture the proximity of 2 or more terms in relationship, you can speed things up. Bigrams are probably the simplest such proximity-capturing structure, but it seems like others could exist. What I'm curious about is whether there are more advanced structures that could potentially speed up proximity search more generally, ideally ones that don't require rewriting huge chunks of Lucene. For example, the following scheme might not be a good idea, but perhaps it hints at the sorts of things one could try: *** Suppose I wanted to optimize cases where a common word appears within 10 words of some other word. Maybe I could get partway there by injecting what you might call "proximitygrams" into the token stream. For example, rather than simply tokenizing like so: fifteen big lazy dogs, a fish, and a cat went shopping --> fifteen big lazy dogs a fish and a cat went shopping I might tokenize like so: fifteen big lazy dogs, a fish, and a cat went shopping --> fifteen big lazy dogs a fish {and, within10_and_fifteen, within10_and_big, within10_and_lazy, within10_and_dogs, within10_and_a, within10_and_fish, within10_and_cat, within10_and_went, within10_and_shopping} a cat went shopping where the curly braces are meant to show overlapping tokens (i.e. with positionIncrement==0). The "within10_x_y" is meant to record that terms x and y appear within 10 words of one another. Then, at query time, if the user wanted to find cases where "lazy" was within 8 words of "and", I could get approximately what I wanted by searching for token "within10_and_lazy". I'd still have to do some post-processing to throw out results where "and" and "lazy" were actually 9 or 10 words apart rather than 8, but I might be part of the way there. Another possible extension for this scheme would be to use payloads so that each of the "within10" tokens actually knew the exact positions of the two terms it was referring to. (I guess it could also record exactly how far away said term was.) It seems like you could potentially use this to optimize SpanNear queries between common terms, or sloppy phrase queries. I don't know if anything like the above would actually pay for itself in terms of increased index size and code complexity. But I wanted to check if this or anything remotely along these lines sounded promising to anyone.