Re: wildcard and span queries

Erik Hatcher Wed, 11 Oct 2006 11:31:26 -0700

Erick - what about using getSpans() from the SpanQuery that isgenerated? That should give you what you're after I think.


        Erik



On Oct 11, 2006, at 2:17 PM, Erick Erickson wrote:

Problem 3482:

I'm probably close to being able to start work. Except...

How to count hits with SrndQuery? Or, more generally, with arbitrary
wildcards and boolean operators?
So, say I've indexed a book by page. That is, each page is adocument. Iknow a particular page matches my query because the SrndQuery foundit. Now,I want to answer the question "How many times did the query matchon this
page"?
For a Srnd query of, say, the form "20w(e?ci, ma?l, k?nd, m??ik?, k?i?f,h?n???d, co?e, ca??l?, r????o?e, cl?p, ho???k?)". Imagine adding anot orthree, a nested pair of OR clauses and..... No, DON'T tell me thatthat
probably wouldn't match any pages anyway <G>....

Anyway, I want to answer "how many times does this occur on the page".
Another way of asking this, I suppose, is "how many terms would be
highlighted on that page", but I don't think highlighting helps.And I'maware that the question "how many times does 'this' occur" isambiguous,
especially when we add the not case in......

I can think of a couple of approaches:
1> get down and dirty with the terms. That is, examine the termpositionvectors and compare all the nasty details of where they occur,combinedwith, say, regextermenum and go at it. This is fairly ugly,especially withnested queries. But I can do it especially if we limit thecomplexity of the
query or define the hitcount more simply.
2> get clever with a regex, fetch the text of the page and see howmany
times the regex matches. I'd imagine that the regex will
be...er...unpleasant.
2a> Use simpler regex expressions for each term, assemble the listof match
positions, and count.
2b> Isn't this really just using TermDocs as it was meant to be used?
combined with regextermenum?
2c> Since the number of regex matches on a particular page is muchsmallerthan the number of regex matches over the entire index, does anyonehave afeel for whether <2a> or <2b> is easier/faster? For <2a>, I'manalyzing apage with a regex. For <2b>, Lucene has already done the patternmatching,
but I'm reading a bunch of different termdocs......
Fortunately, for this application, I only care about the hits perpage for asingle book at a time. I do NOT have to create a list of all hitson all
pages for all books that have any match.

Thanks
Erick

On 10/9/06, Erick Erickson <[EMAIL PROTECTED]> wrote:
Doron:
Thanks for the suggestion, I'll certainly put it on my list,depending
upon what the PM decides. This app is geneaology reasearch, and users
*can* put in their own wildcards...
This is why I love this list... lots of smart people giving mesuggestions
I never would have thought of <G>...

Thanks
Erick

On 10/9/06, Doron Cohen < [EMAIL PROTECTED]> wrote:
>
> "Erick Erickson" <[EMAIL PROTECTED]> wrote on 09/10/200613:09:21:
> > ... The kicker is that what we are indexing is
> > OCR data, some of which is pretty trashy. So you wind up with
> "interesting"
> > words in your index, things like rtyHrS. So the whole question of
> allowing
> > very specific queries on detailed wildcards (combined withspans) is
> under
> > discussion. It's not at all clear to me that there's any valueto the
> end
> > users in the capability of, say, two character prefixes. And,it's an
> easy
> > rule that "prefix queries must specify at least 3 non-wildcard
> > characters"....
>
> Erick, I may be out of course here, but, fwiw, have you considered
> n-gram
> indexing/search for a degree of fuzziness to compensate for OCR
> errors..?
>
> For a four words query you would probably get ~20 tokens(bigrams?) - no> matter what the index size is. You would then probably want toscore
> higher
> by LA (lexical affinity - query terms appear close to each otherin the
> document) - and I am not sure to what degree a span query (made of
> n-gram
> terms) would serve that, because (1) all terms in the span needto be
> there
> (well, I think:-); and, (2) you would like to increase doc scorefor
> close-by terms only for close-by query n-grams.
>
> So there might not be a ready to use solution in Lucene forthis, but
> perhaps this is a more robust direction to try than the wild card
> approach
> - I mean, if users want to type a wild card query, it is theirright to
> do
> so, but for an application logic this does not seem the bestchoice.
>
>
>---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: wildcard and span queries

Reply via email to