On Tuesday 18 March 2003 15:54, none none wrote: ... > > Terms so they can be requested from Query, or returned along with Search > > results). This would require changes to Query classes however. > > Smart! i didn't really think where to put it but i tought would be good > avoid that because many users do not need the highlight aka termCollector, > so why force them? In your solution as i said more elegant, the user has to > decide to do so, that mean in my case set the varable to true. Good idea > Tatu!!
Thanks. We could still have (for compatibility) have method with old signature, that calls 'new' method with 'false' for new argument. > >One problem I tried to solve was that user shouldn't have to know > > structure of Query classes (that's what visitor pattern in general ... > >structure, if it only needs terms, not context (ie. need not know which > > Term came from which query; sometimes this is needed, esp. with phrase > > queries). > > Yes, i didn't explain but what i actully do in my HighLighter class is kind > like your TermCollector, i put all the terms together. Please Note that i > add extra information when i collect them, i put the "slop" for example, > that is because of my Highlight implementation i need to know its value. > Let's say i do something more then just collect in this class. Ok, that makes sense. Also, I started thinking that perhaps combining parts of two approaches would make lots of sense, improving performance of my solution, and generalizing your solution a bit? (ie. there'd be more support from core Lucene for implementing highlighters) I think having a term query collector (and matching iterator) makes sense. This way all Queries could be easily collected, along with some flags that BooleanClause has (optional etc). This is fairly easy to do, and doesn't have too many performance problems. Plus, caller need then not worry about actual Query tree structure, even if new Queries are added, it's Query's responsibility to add that one traversal method implementation. I also don't think this adds too much clutter to general code base. However, after queries are collected, it would be possible to access collected Terms using method you implemented, ie. having a method to access Terms collected during query execution. Caller also can choose to do additional query type dependant handling if/as necessary at this point (to access slop amongst other things?) So essentially one could traverse all Queries easily, and for each one ask for all the actual terms, without having to worry about exact query type, unless it wants to. Now, for some extra convenience, it would be easy to add simple iterators over actual terms. Since method for accessing collected Terms would be in base class, there would be no need to have half a dozen or more iterator classes I had to add to encapsulate collection process. But that would be optional thing to have. Finally, a method similar to accessing collected actual terms, but for accessing base term(s) would be useful. Since there can be up to 2 base terms (for range query), I'm not sure of method signature, but implementation should be easy to add (perhaps use signature similar to many JDK API methods, where an optional Collection is passed, into which store Term(s); if null is passed, a new Collection like ArrayList is created and returned). Does this make sense? > >Like I said above, while you are right that it does have overhead > > (computing terms twice), I'm not sure how significant that would be in > > general, compared to search, scoring etc. > >It would be good to do some simple tests to see if I'm wrong here and Term > >collection is actually big part of execution time. > > I believe, and as you said we could run a test, in WildCard or Prefix query > this will make a markable difference. That could be, for big data sets, and prefix/wildcard queries that have lots of terms. Fortunately highlighting is only done for single documents at a time (usually?). Another way around the problem is to start from highlighted document, and build a (temporary) index, and actually execute query against just this single dummy (RAMDirectory based) index (that contains only terms from that one doc to be highlighter). It would be interesting to see if this might be more efficient way to find actual matched terms. > >One other thing I was thinking about was refactoring Range and Prefix > > queries to be MultiTermQuery - based. I think that should benefit both > > solutions. > > I totally agree with you, also i believe everything can be BooleanQuery and > MultiTermQuery, TermQuery would be a MultiTermQuery with one term in the > array, for instance. I agree, generic (actual) term access/collecting method should be available from any Query (and actually same for base terms). > >Plus, it seems to me that PhrasePrefixQuery perhaps should just be > > rewritten. It acts very different from other queries, requiring caller to ... > I don't really use PhrasePrefixQuery, also because it is not supported by > the QueryParser, you have to create it, so for now i just avoid to use it. Yes, I just happened to notice it in search package, didn't know such a thing existed as query parser has (currently?) no way to use it. :-) Of course, having PhrasePrefixQuery, one wonders if it'd make sense to have PhraseWildcardQuery as well. :-) (don't think implementing that would be any more difficult than prefix one, but both may be fairly inefficient in some cases) Thanks for your ideas and suggestions, -+ Tatu +- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
