Hey! I'm actually looking all on my own. Anyway, 2<b> gives me TooManyClauses. It looks like what I want is to use IndexReader.termPositions to aggregate the offsets of all the wildcard terms on a per-document basis, then walk the lists for proximity. Something like...
for each wildcard term wt for each WildCardTermEnum wet for each termdoc aggregate the positions by doc ID with other terms from wt. Now I have, for each original wildcard term a list of all doc IDs and all positions that any term matching the wildcard occupies. For any doc that appears in all lists, compare the positions for proximity and, if proximity is met, add it to my filter. And away we go. Of course, I have no idea what the speed here is, but I guess that's what testing is for. Am I on the right path? Erick On 8/2/06, Erick Erickson <[EMAIL PROTECTED]> wrote:
I'm back, with another flavor of wildcards. What direction would you point a poor boy who's project lead wants wildcard queries and spans? Here's the problem.... I cannot use any of the classes that throw a "TooManyClauses" exception ( e.g. SpanRegexQuery or SpanNearQuery with, say WildCardQuery). The corpus is big enough that this is guaranteed to be thrown. So, currently I'm using a filter for wildcard queries, populating it via WildcardTermEnum and TermDocs... Works like a champ. But I don't see how to combine this with spans... It seems to me that spans are incompatible with filters, they're just different beasts. I see no way incorporate spans and filters without doing actual work myself. So, it seems I'm left with several alternatives. 1> figure it out when creating the filter. Conceptually, for each document find the offsets of the terms I want to span, and find out if the distance between them fits my criteria and only add the doc to the filter if the distance is within my parameters. 2> Look at the docs returned by the current filtered process and, for each doc returned, a> don't add if it doesn't fit my span criteria by examining the term positions. b> re-query with a wildcard span, restricted by doc ID. I *think* that by restricting the query by (lucene) doc_id I'll be able to avoid the "too many clauses" issue. Assuming that I remember correctly and that the most-restrictive clause is honored when trying this.... guys, feel free to hop in here with just the names of the classes I really want to pay attention to <G>.... I know this is scanty info, what I'm looking for is a very quick pointer.... What I'm especially looking for is "Just use the contrib/JustWhatYouWanted class" <G> although I poked around and didn't see anything... Thanks Erick