Erick - what about using getSpans() from the SpanQuery that is
generated? That should give you what you're after I think.
Erik
On Oct 11, 2006, at 2:17 PM, Erick Erickson wrote:
Problem 3482:
I'm probably close to being able to start work. Except...
How to count hits with SrndQuery? Or, more generally, with arbitrary
wildcards and boolean operators?
So, say I've indexed a book by page. That is, each page is a
document. I
know a particular page matches my query because the SrndQuery found
it. Now,
I want to answer the question "How many times did the query match
on this
page"?
For a Srnd query of, say, the form "20w(e?ci, ma?l, k?nd, m??ik?, k?
i?f,
h?n???d, co?e, ca??l?, r????o?e, cl?p, ho???k?)". Imagine adding a
not or
three, a nested pair of OR clauses and..... No, DON'T tell me that
that
probably wouldn't match any pages anyway <G>....
Anyway, I want to answer "how many times does this occur on the page".
Another way of asking this, I suppose, is "how many terms would be
highlighted on that page", but I don't think highlighting helps.
And I'm
aware that the question "how many times does 'this' occur" is
ambiguous,
especially when we add the not case in......
I can think of a couple of approaches:
1> get down and dirty with the terms. That is, examine the term
position
vectors and compare all the nasty details of where they occur,
combined
with, say, regextermenum and go at it. This is fairly ugly,
especially with
nested queries. But I can do it especially if we limit the
complexity of the
query or define the hitcount more simply.
2> get clever with a regex, fetch the text of the page and see how
many
times the regex matches. I'd imagine that the regex will
be...er...unpleasant.
2a> Use simpler regex expressions for each term, assemble the list
of match
positions, and count.
2b> Isn't this really just using TermDocs as it was meant to be used?
combined with regextermenum?
2c> Since the number of regex matches on a particular page is much
smaller
than the number of regex matches over the entire index, does anyone
have a
feel for whether <2a> or <2b> is easier/faster? For <2a>, I'm
analyzing a
page with a regex. For <2b>, Lucene has already done the pattern
matching,
but I'm reading a bunch of different termdocs......
Fortunately, for this application, I only care about the hits per
page for a
single book at a time. I do NOT have to create a list of all hits
on all
pages for all books that have any match.
Thanks
Erick
On 10/9/06, Erick Erickson <[EMAIL PROTECTED]> wrote:
Doron:
Thanks for the suggestion, I'll certainly put it on my list,
depending
upon what the PM decides. This app is geneaology reasearch, and users
*can* put in their own wildcards...
This is why I love this list... lots of smart people giving me
suggestions
I never would have thought of <G>...
Thanks
Erick
On 10/9/06, Doron Cohen < [EMAIL PROTECTED]> wrote:
>
> "Erick Erickson" <[EMAIL PROTECTED]> wrote on 09/10/2006
13:09:21:
> > ... The kicker is that what we are indexing is
> > OCR data, some of which is pretty trashy. So you wind up with
> "interesting"
> > words in your index, things like rtyHrS. So the whole question of
> allowing
> > very specific queries on detailed wildcards (combined with
spans) is
> under
> > discussion. It's not at all clear to me that there's any value
to the
> end
> > users in the capability of, say, two character prefixes. And,
it's an
> easy
> > rule that "prefix queries must specify at least 3 non-wildcard
> > characters"....
>
> Erick, I may be out of course here, but, fwiw, have you considered
> n-gram
> indexing/search for a degree of fuzziness to compensate for OCR
> errors..?
>
> For a four words query you would probably get ~20 tokens
(bigrams?) - no
> matter what the index size is. You would then probably want to
score
> higher
> by LA (lexical affinity - query terms appear close to each other
in the
> document) - and I am not sure to what degree a span query (made of
> n-gram
> terms) would serve that, because (1) all terms in the span need
to be
> there
> (well, I think:-); and, (2) you would like to increase doc score
for
> close-by terms only for close-by query n-grams.
>
> So there might not be a ready to use solution in Lucene for
this, but
> perhaps this is a more robust direction to try than the wild card
> approach
> - I mean, if users want to type a wild card query, it is their
right to
> do
> so, but for an application logic this does not seem the best
choice.
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]