On Wednesday 11 October 2006 20:30, Erik Hatcher wrote: > Erick - what about using getSpans() from the SpanQuery that is > generated? That should give you what you're after I think. > > Erik
You can also use skipTo(docNr) on the spans to skip to the docNr of the book that you're after. A Filter for the single book would also work, but using skipTo() yourself on the spans is easier. Regards, Paul Elschot > > > On Oct 11, 2006, at 2:17 PM, Erick Erickson wrote: > > > Problem 3482: > > > > I'm probably close to being able to start work. Except... > > > > How to count hits with SrndQuery? Or, more generally, with arbitrary > > wildcards and boolean operators? > > > > So, say I've indexed a book by page. That is, each page is a > > document. I > > know a particular page matches my query because the SrndQuery found > > it. Now, > > I want to answer the question "How many times did the query match > > on this > > page"? > > > > For a Srnd query of, say, the form "20w(e?ci, ma?l, k?nd, m??ik?, k? > > i?f, > > h?n???d, co?e, ca??l?, r????o?e, cl?p, ho???k?)". Imagine adding a > > not or > > three, a nested pair of OR clauses and..... No, DON'T tell me that > > that > > probably wouldn't match any pages anyway <G>.... > > > > Anyway, I want to answer "how many times does this occur on the page". > > Another way of asking this, I suppose, is "how many terms would be > > highlighted on that page", but I don't think highlighting helps. > > And I'm > > aware that the question "how many times does 'this' occur" is > > ambiguous, > > especially when we add the not case in...... > > > > I can think of a couple of approaches: > > 1> get down and dirty with the terms. That is, examine the term > > position > > vectors and compare all the nasty details of where they occur, > > combined > > with, say, regextermenum and go at it. This is fairly ugly, > > especially with > > nested queries. But I can do it especially if we limit the > > complexity of the > > query or define the hitcount more simply. > > 2> get clever with a regex, fetch the text of the page and see how > > many > > times the regex matches. I'd imagine that the regex will > > be...er...unpleasant. > > 2a> Use simpler regex expressions for each term, assemble the list > > of match > > positions, and count. > > 2b> Isn't this really just using TermDocs as it was meant to be used? > > combined with regextermenum? > > 2c> Since the number of regex matches on a particular page is much > > smaller > > than the number of regex matches over the entire index, does anyone > > have a > > feel for whether <2a> or <2b> is easier/faster? For <2a>, I'm > > analyzing a > > page with a regex. For <2b>, Lucene has already done the pattern > > matching, > > but I'm reading a bunch of different termdocs...... > > > > Fortunately, for this application, I only care about the hits per > > page for a > > single book at a time. I do NOT have to create a list of all hits > > on all > > pages for all books that have any match. > > > > Thanks > > Erick > > > > On 10/9/06, Erick Erickson <[EMAIL PROTECTED]> wrote: > >> > >> Doron: > >> > >> Thanks for the suggestion, I'll certainly put it on my list, > >> depending > >> upon what the PM decides. This app is geneaology reasearch, and users > >> *can* put in their own wildcards... > >> > >> This is why I love this list... lots of smart people giving me > >> suggestions > >> I never would have thought of <G>... > >> > >> Thanks > >> Erick > >> > >> On 10/9/06, Doron Cohen < [EMAIL PROTECTED]> wrote: > >> > > >> > "Erick Erickson" <[EMAIL PROTECTED]> wrote on 09/10/2006 > >> 13:09:21: > >> > > ... The kicker is that what we are indexing is > >> > > OCR data, some of which is pretty trashy. So you wind up with > >> > "interesting" > >> > > words in your index, things like rtyHrS. So the whole question of > >> > allowing > >> > > very specific queries on detailed wildcards (combined with > >> spans) is > >> > under > >> > > discussion. It's not at all clear to me that there's any value > >> to the > >> > end > >> > > users in the capability of, say, two character prefixes. And, > >> it's an > >> > easy > >> > > rule that "prefix queries must specify at least 3 non-wildcard > >> > > characters".... > >> > > >> > Erick, I may be out of course here, but, fwiw, have you considered > >> > n-gram > >> > indexing/search for a degree of fuzziness to compensate for OCR > >> > errors..? > >> > > >> > For a four words query you would probably get ~20 tokens > >> (bigrams?) - no > >> > matter what the index size is. You would then probably want to > >> score > >> > higher > >> > by LA (lexical affinity - query terms appear close to each other > >> in the > >> > document) - and I am not sure to what degree a span query (made of > >> > n-gram > >> > terms) would serve that, because (1) all terms in the span need > >> to be > >> > there > >> > (well, I think:-); and, (2) you would like to increase doc score > >> for > >> > close-by terms only for close-by query n-grams. > >> > > >> > So there might not be a ready to use solution in Lucene for > >> this, but > >> > perhaps this is a more robust direction to try than the wild card > >> > approach > >> > - I mean, if users want to type a wild card query, it is their > >> right to > >> > do > >> > so, but for an application logic this does not seem the best > >> choice. > >> > > >> > > >> > > >> --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: [EMAIL PROTECTED] > >> > For additional commands, e-mail: [EMAIL PROTECTED] > >> > > >> > > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]