Re: [htdig-dev] Re: Progress towards 3.1.6

Gilles Detillieux Fri, 18 Jan 2002 09:47:35 -0800

According to Geoff Hutchison:
> At 10:20 AM +0100 1/17/02, J. op den Brouw wrote:
> ><word1> and * not <word2> == <word1> not <word2>
> >
> >Can that be parsed easily?
> 
> Not in the current parser.cc. Perhaps I should rephrase that... I'm 
> not going to write code to deal with that case and certainly not for 
> a "production release."
> 
> If someone else thinks they can tackle that type of query in a 
> reasonable time-frame and can convince us that it doesn't introduce 
> add'l bugs in the query parser, then great. I'm of the opinion that 
> the sooner we can go to a new parser, the better.


Well, what I had envisioned is actually a pretty trivial hook in the
parser.cc code.  It wouldn't optimize the and and or operations as
Jesse suggested, but it would just treat the * (or actually whatever
prefix_match_character is set to) as a word.  All it would take is a
simple test in Parser::perform_push() to test if the word in "temp"
matches the string in prefix_match_character, and if so, it builds a
dummy ResultList with all valid document IDs.  All it needs is a method
to call to get that list of docIDs - that's the part I need help with.
Apart from that, all it would take is a little hook in setupWords()
to allow a bare prefix_match_character as a word even though it's
shorter than minimum_word_length.

The only simple technique I can think of to get the list of valid docIDs
into the parser without actually modifying the parser, is to put a hook
in htmerge/words.cc to keep track of all the docIDs and then put in a
dummy record into the db.words.db with a list of cooked-up WordRecords.
That would work, but it's not as clean as I'd like.  A better solution
would be the parser hook I mentioned, but then getting that list of
docIDs would require looking into one of the other two databases, as
you're not going to find it in the word database.

In any case, the eventual outcome must be a ResultList with all valid
docIDs, so I don't think that's any more complicated to patch into the
parser than right into htsearch's main() function.  As Jesse pointed out,
combining * with and or or doesn't really get you anything, but it might
be nice to be able to do "* not foo".  As for the score field in the
DocMatch objects in the ResultList, they could be assigned something
arbitrarily low, like 1, or text_factor, or text_factor * current->weight.

As an aside, we've always operated under the assumption that the word
location affected the word score somehow, but I can't find any code in
htsearch that does this.  As far as I can tell, when the info is transfered
from WordRecords to DocMatches, the location field is completely ignored.
Indeed, when I grep for "location" in all the copies of htsearch source
I have (right back to 3.0.8b2), the only reference I find to that word
is in 3.0.8b2's obsolete htsearch/display.cc module, which isn't used
at all, and which contains a section of disabled code that calculates
scores based on word location in DocHead.  I think it's pretty safe to
say the location codes that htdig & htmerge so carefully calculate and
manage are entirely useless.  Am I missing something here?

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] Re: Progress towards 3.1.6

Reply via email to