On Wednesday 11 October 2006 20:30, Erik Hatcher wrote:
> Erick - what about using getSpans() from the SpanQuery that is  
> generated?   That should give you what you're after I think.
> 
>       Erik

You can also use skipTo(docNr) on the spans to skip to the docNr
of the book that you're after. A Filter for the single book would also
work, but using skipTo() yourself on the spans is easier.

Regards,
Paul Elschot


> 
> 
> On Oct 11, 2006, at 2:17 PM, Erick Erickson wrote:
> 
> > Problem 3482:
> >
> > I'm probably close to being able to start work. Except...
> >
> > How to count hits with SrndQuery? Or, more generally, with arbitrary
> > wildcards and boolean operators?
> >
> > So, say I've indexed a book by page. That is, each page is a  
> > document. I
> > know a particular page matches my query because the SrndQuery found  
> > it. Now,
> > I want to answer the question "How many times did the query match  
> > on this
> > page"?
> >
> > For a Srnd query of, say, the form "20w(e?ci, ma?l, k?nd, m??ik?, k? 
> > i?f,
> > h?n???d, co?e, ca??l?, r????o?e, cl?p, ho???k?)". Imagine adding a  
> > not or
> > three, a nested pair of OR clauses and..... No, DON'T tell me that  
> > that
> > probably wouldn't match any pages anyway <G>....
> >
> > Anyway, I want to answer "how many times does this occur on the page".
> > Another way of asking this, I suppose, is "how many terms would be
> > highlighted on that page", but I don't think highlighting helps.  
> > And I'm
> > aware that the question "how many times does 'this' occur" is  
> > ambiguous,
> > especially when we add the not case in......
> >
> > I can think of a couple of approaches:
> > 1> get down and dirty with the terms. That is, examine the term  
> > position
> > vectors and compare all the nasty details of where they occur,  
> > combined
> > with, say, regextermenum and go at it. This is fairly ugly,  
> > especially with
> > nested queries. But I can do it especially if we limit the  
> > complexity of the
> > query or define the hitcount more simply.
> > 2> get clever with a regex, fetch the text of the page and see how  
> > many
> > times the regex matches. I'd imagine that the regex will
> > be...er...unpleasant.
> > 2a> Use simpler regex expressions for each term, assemble the list  
> > of match
> > positions, and count.
> > 2b> Isn't this really just using TermDocs as it was meant to be used?
> > combined with regextermenum?
> > 2c> Since the number of regex matches on a particular page is much  
> > smaller
> > than the number of regex matches over the entire index, does anyone  
> > have a
> > feel for whether <2a> or <2b> is easier/faster? For <2a>, I'm  
> > analyzing a
> > page with a regex. For <2b>, Lucene has already done the pattern  
> > matching,
> > but I'm reading a bunch of different termdocs......
> >
> > Fortunately, for this application, I only care about the hits per  
> > page for a
> > single book at a time. I do NOT have to create a list of all hits  
> > on all
> > pages for all books that have any match.
> >
> > Thanks
> > Erick
> >
> > On 10/9/06, Erick Erickson <[EMAIL PROTECTED]> wrote:
> >>
> >> Doron:
> >>
> >> Thanks for the suggestion, I'll certainly put it on my list,  
> >> depending
> >> upon what the PM decides. This app is geneaology reasearch, and users
> >> *can* put in their own wildcards...
> >>
> >> This is why I love this list... lots of smart people giving me  
> >> suggestions
> >> I never would have thought of <G>...
> >>
> >> Thanks
> >> Erick
> >>
> >> On 10/9/06, Doron Cohen < [EMAIL PROTECTED]> wrote:
> >> >
> >> > "Erick Erickson" <[EMAIL PROTECTED]> wrote on 09/10/2006  
> >> 13:09:21:
> >> > > ... The kicker is that what we are indexing is
> >> > > OCR data, some of which is pretty trashy. So you wind up with
> >> > "interesting"
> >> > > words in your index, things like rtyHrS. So the whole question of
> >> > allowing
> >> > > very specific queries on detailed wildcards (combined with  
> >> spans) is
> >> > under
> >> > > discussion. It's not at all clear to me that there's any value  
> >> to the
> >> > end
> >> > > users in the capability of, say, two character prefixes. And,  
> >> it's an
> >> > easy
> >> > > rule that "prefix queries must specify at least 3 non-wildcard
> >> > > characters"....
> >> >
> >> > Erick, I may be out of course here, but, fwiw, have you considered
> >> > n-gram
> >> > indexing/search for a degree of fuzziness to compensate for OCR
> >> > errors..?
> >> >
> >> > For a four words query you would probably get ~20 tokens  
> >> (bigrams?) - no
> >> > matter what the index size is. You would then probably want to  
> >> score
> >> > higher
> >> > by LA (lexical affinity - query terms appear close to each other  
> >> in the
> >> > document) - and I am not sure to what degree a span query (made of
> >> > n-gram
> >> > terms) would serve that, because (1) all terms in the span need  
> >> to be
> >> > there
> >> > (well, I think:-); and, (2) you would like to increase doc score  
> >> for
> >> > close-by terms only for close-by query n-grams.
> >> >
> >> > So there might not be a ready to use solution in Lucene for  
> >> this, but
> >> > perhaps this is a more robust direction to try than the wild card
> >> > approach
> >> > - I mean, if users want to type a wild card query, it is their  
> >> right to
> >> > do
> >> > so, but for an application logic this does not seem the best  
> >> choice.
> >> >
> >> >
> >> >  
> >> ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> > For additional commands, e-mail: [EMAIL PROTECTED]
> >> >
> >> >
> >>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to