Erick - what about using getSpans() from the SpanQuery that is generated? That should give you what you're after I think.

        Erik


On Oct 11, 2006, at 2:17 PM, Erick Erickson wrote:

Problem 3482:

I'm probably close to being able to start work. Except...

How to count hits with SrndQuery? Or, more generally, with arbitrary
wildcards and boolean operators?

So, say I've indexed a book by page. That is, each page is a document. I know a particular page matches my query because the SrndQuery found it. Now, I want to answer the question "How many times did the query match on this
page"?

For a Srnd query of, say, the form "20w(e?ci, ma?l, k?nd, m??ik?, k? i?f, h?n???d, co?e, ca??l?, r????o?e, cl?p, ho???k?)". Imagine adding a not or three, a nested pair of OR clauses and..... No, DON'T tell me that that
probably wouldn't match any pages anyway <G>....

Anyway, I want to answer "how many times does this occur on the page".
Another way of asking this, I suppose, is "how many terms would be
highlighted on that page", but I don't think highlighting helps. And I'm aware that the question "how many times does 'this' occur" is ambiguous,
especially when we add the not case in......

I can think of a couple of approaches:
1> get down and dirty with the terms. That is, examine the term position vectors and compare all the nasty details of where they occur, combined with, say, regextermenum and go at it. This is fairly ugly, especially with nested queries. But I can do it especially if we limit the complexity of the
query or define the hitcount more simply.
2> get clever with a regex, fetch the text of the page and see how many
times the regex matches. I'd imagine that the regex will
be...er...unpleasant.
2a> Use simpler regex expressions for each term, assemble the list of match
positions, and count.
2b> Isn't this really just using TermDocs as it was meant to be used?
combined with regextermenum?
2c> Since the number of regex matches on a particular page is much smaller than the number of regex matches over the entire index, does anyone have a feel for whether <2a> or <2b> is easier/faster? For <2a>, I'm analyzing a page with a regex. For <2b>, Lucene has already done the pattern matching,
but I'm reading a bunch of different termdocs......

Fortunately, for this application, I only care about the hits per page for a single book at a time. I do NOT have to create a list of all hits on all
pages for all books that have any match.

Thanks
Erick

On 10/9/06, Erick Erickson <[EMAIL PROTECTED]> wrote:

Doron:

Thanks for the suggestion, I'll certainly put it on my list, depending
upon what the PM decides. This app is geneaology reasearch, and users
*can* put in their own wildcards...

This is why I love this list... lots of smart people giving me suggestions
I never would have thought of <G>...

Thanks
Erick

On 10/9/06, Doron Cohen < [EMAIL PROTECTED]> wrote:
>
> "Erick Erickson" <[EMAIL PROTECTED]> wrote on 09/10/2006 13:09:21:
> > ... The kicker is that what we are indexing is
> > OCR data, some of which is pretty trashy. So you wind up with
> "interesting"
> > words in your index, things like rtyHrS. So the whole question of
> allowing
> > very specific queries on detailed wildcards (combined with spans) is
> under
> > discussion. It's not at all clear to me that there's any value to the
> end
> > users in the capability of, say, two character prefixes. And, it's an
> easy
> > rule that "prefix queries must specify at least 3 non-wildcard
> > characters"....
>
> Erick, I may be out of course here, but, fwiw, have you considered
> n-gram
> indexing/search for a degree of fuzziness to compensate for OCR
> errors..?
>
> For a four words query you would probably get ~20 tokens (bigrams?) - no > matter what the index size is. You would then probably want to score
> higher
> by LA (lexical affinity - query terms appear close to each other in the
> document) - and I am not sure to what degree a span query (made of
> n-gram
> terms) would serve that, because (1) all terms in the span need to be
> there
> (well, I think:-); and, (2) you would like to increase doc score for
> close-by terms only for close-by query n-grams.
>
> So there might not be a ready to use solution in Lucene for this, but
> perhaps this is a more robust direction to try than the wild card
> approach
> - I mean, if users want to type a wild card query, it is their right to
> do
> so, but for an application logic this does not seem the best choice.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to