Re: Lucene Highlighter

none none Sun, 02 Mar 2003 23:18:25 -0800

hi,
after digging a little bit the code i came up with some questions due to make the 
highlighter working with the future release of Lucene (1.3).
The questions are:


- why phrase uses a Vector and PhrasePrefix an ArrayList? just curious.

- Is it possible add a method "public Term[] getTermsArray()" that will return the 
"termArrays" from the PhrasePrefixQuery? Is it still populated after we run the search?

- Is it possible have a PhrasePrefixQuery of 2+ terms? e.g.: "Microsoft Soft* Windo*" 
? why are there 2 methods, one to add a single term another one to add more than one 
term? is the termsArray an array on term's array ?

- Is it correct that PrefixQuery.rewrite(...) is called by the searcher (reader?) at 
search time to have a BooleanQuery with "OR" condition between each clause? each 
clause holds a termquery?

- PrefixQuery > what do you think of this scenario: user set "populateTermArray()" 
before run the search, we set a static variable inside the Query class so the setting 
is reflected to all the XxxQuery classes, in the 'rewrite' method we check this value 
and if true (default false) we store each term in an array 'termsArray' one for each 
implementation (wildcard, etc), then when we need to highlight we call getTermsArray() 
for each query based on the instance type (again: wildcard, etc), then we set the 
array to null or wait for the garbage collector to release this resource. sounds good??

- PrefixQuery and other query classes that has this method 'rewrite'>> can the method 
be called more than once at search time? if so we should hold the privious array of 
terms and add to it the new terms without duplicates.

- RangeQuery >> can we apply the same criteria as for the PrefixQuery?

- All the classes that extends MultiTermQuery >> can we apply the same criteria as for 
PrefixQuery? (as above, just add a vector that holds the terms, if the user wants to, 
and get this array when highlighting, may call a method to release the resource after 
we are done with the highlight)

- how it is possible get the term position of a particular term in a particular 
document in the index? this will improve a lot the process to get start and end offset 
of a term in a document. i assume that a text version of the field to highlight is 
available, e.g.: the content of an html page is a field and is stored in a single text 
file. Also would make it compatible with the tokenizer as we will use the same we did 
at indexing time, avoid to write a pattern for each criteria in the RegExp (actually 
it will not be necessary anymore!)

- would all these changes make slower the search process? as a guess, how much?

- would the termposition call be slow?
 
Thank you guys.


_____________________________________________________________
Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for $19.95/year.
http://login.mail.lycos.com/brandPage.shtml?pageId=plus&ref=lmtplus

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene Highlighter

Reply via email to