Re: Iterators for collecting Terms from Queries

Tatu Saloranta Wed, 19 Mar 2003 09:47:55 -0800

On Tuesday 18 March 2003 15:54, none none wrote:
...
> > Terms so they can be requested from Query, or returned along with Search
> > results). This would require changes to Query classes however.
>
> Smart! i didn't really think where to put it but i tought would be good
> avoid that because many users do not need the highlight aka termCollector,
> so why force them? In your solution as i said more elegant, the user has to
> decide to do so, that mean in my case set the varable to true. Good idea
> Tatu!!


Thanks. We could still have (for compatibility) have method with old 
signature, that calls 'new' method with 'false' for new argument.

> >One problem I tried to solve was that user shouldn't have to know
> > structure of Query classes (that's what visitor pattern in general
...
> >structure, if it only needs terms, not context (ie. need not know which
> > Term came from which query; sometimes this is needed, esp. with phrase
> > queries).
>
> Yes, i didn't explain but what i actully do in my HighLighter class is kind
> like your TermCollector, i put all the terms together. Please Note that i
> add extra information when i collect them, i put the "slop" for example,
> that is because of my Highlight implementation i need to know its value.
> Let's say i do something more then just collect in this class.

Ok, that makes sense.

Also, I started thinking that perhaps combining parts of two approaches would 
make lots of sense, improving performance of my solution, and generalizing 
your solution a bit? (ie. there'd be more support from core Lucene for 
implementing highlighters)

I think having a term query collector (and matching iterator) makes sense. 
This way all Queries could be easily collected, along with some flags that 
BooleanClause has (optional etc). This is fairly easy to do, and doesn't have 
too many performance problems. Plus, caller need then not worry about actual 
Query tree structure, even if new Queries are added, it's Query's 
responsibility to add that one traversal method implementation.
I also don't think this adds too much clutter to general code base.

However, after queries are collected, it would be possible to access collected 
Terms using method you implemented, ie. having a method to access Terms 
collected during query execution. Caller also can choose to do additional 
query type dependant handling if/as necessary at this point (to access slop 
amongst other things?)
So essentially one could traverse all Queries easily, and for each one ask for 
all the actual terms, without having to worry about exact query type, unless 
it wants to.

Now, for some extra convenience, it would be easy to add simple iterators over 
actual terms. Since method for accessing collected Terms would be in base 
class, there would be no need to have half a dozen or more iterator classes I 
had to add to encapsulate collection process. But that would be optional 
thing to have.

Finally, a method similar to accessing collected actual terms, but for 
accessing base term(s) would be useful. Since there can be up to 2 base terms 
(for range query), I'm not sure of method signature, but implementation 
should be easy to add (perhaps use signature similar to many JDK API methods, 
where an optional Collection is passed, into which store Term(s); if null is 
passed, a new Collection like ArrayList is created and returned).

Does this make sense?

> >Like I said above, while you are right that it does have overhead
> > (computing terms twice), I'm not sure how significant that would be in
> > general, compared to search, scoring etc.
> >It would be good to do some simple tests to see if I'm wrong here and Term
> >collection is actually big part of execution time.
>
> I believe, and as you said we could run a test, in WildCard or Prefix query
> this will make a markable difference.

That could be, for big data sets, and prefix/wildcard queries that have lots 
of terms.

Fortunately highlighting is only done for single documents at a time 
(usually?).

Another way around the problem is to start from highlighted document, and 
build a (temporary) index, and actually execute query against just this 
single dummy (RAMDirectory based) index (that contains only terms from that 
one doc to be highlighter). It would be interesting to see if this might be 
more efficient way to find actual matched terms.

> >One other thing I was thinking about was refactoring Range and Prefix
> > queries to be MultiTermQuery - based. I think that should benefit both
> > solutions.
>
> I totally agree with you, also i believe everything can be BooleanQuery and
> MultiTermQuery, TermQuery would be a MultiTermQuery with one term in the
> array, for instance.

I agree, generic (actual) term access/collecting method should be available 
from any Query (and actually same for base terms).

> >Plus, it seems to me that PhrasePrefixQuery perhaps should just be
> > rewritten. It acts very different from other queries, requiring caller to
...
> I don't really use PhrasePrefixQuery, also because it is not supported by
> the QueryParser, you have to create it, so for now i just avoid to use it.

Yes, I just happened to notice it in search package, didn't know such a thing 
existed as query parser has (currently?) no way to use it. :-)

Of course, having PhrasePrefixQuery, one wonders if it'd make sense to
have PhraseWildcardQuery as well. :-)
(don't think implementing that would be any more difficult than prefix one, 
but both may be fairly inefficient in some cases)

Thanks for your ideas and suggestions,

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Iterators for collecting Terms from Queries

Reply via email to