According to Geoff Hutchison: > At 10:20 AM +0100 1/17/02, J. op den Brouw wrote: > ><word1> and * not <word2> == <word1> not <word2> > > > >Can that be parsed easily? > > Not in the current parser.cc. Perhaps I should rephrase that... I'm > not going to write code to deal with that case and certainly not for > a "production release." > > If someone else thinks they can tackle that type of query in a > reasonable time-frame and can convince us that it doesn't introduce > add'l bugs in the query parser, then great. I'm of the opinion that > the sooner we can go to a new parser, the better.
Well, what I had envisioned is actually a pretty trivial hook in the parser.cc code. It wouldn't optimize the and and or operations as Jesse suggested, but it would just treat the * (or actually whatever prefix_match_character is set to) as a word. All it would take is a simple test in Parser::perform_push() to test if the word in "temp" matches the string in prefix_match_character, and if so, it builds a dummy ResultList with all valid document IDs. All it needs is a method to call to get that list of docIDs - that's the part I need help with. Apart from that, all it would take is a little hook in setupWords() to allow a bare prefix_match_character as a word even though it's shorter than minimum_word_length. The only simple technique I can think of to get the list of valid docIDs into the parser without actually modifying the parser, is to put a hook in htmerge/words.cc to keep track of all the docIDs and then put in a dummy record into the db.words.db with a list of cooked-up WordRecords. That would work, but it's not as clean as I'd like. A better solution would be the parser hook I mentioned, but then getting that list of docIDs would require looking into one of the other two databases, as you're not going to find it in the word database. In any case, the eventual outcome must be a ResultList with all valid docIDs, so I don't think that's any more complicated to patch into the parser than right into htsearch's main() function. As Jesse pointed out, combining * with and or or doesn't really get you anything, but it might be nice to be able to do "* not foo". As for the score field in the DocMatch objects in the ResultList, they could be assigned something arbitrarily low, like 1, or text_factor, or text_factor * current->weight. As an aside, we've always operated under the assumption that the word location affected the word score somehow, but I can't find any code in htsearch that does this. As far as I can tell, when the info is transfered from WordRecords to DocMatches, the location field is completely ignored. Indeed, when I grep for "location" in all the copies of htsearch source I have (right back to 3.0.8b2), the only reference I find to that word is in 3.0.8b2's obsolete htsearch/display.cc module, which isn't used at all, and which contains a section of disabled code that calculates scores based on word location in DocHead. I think it's pretty safe to say the location codes that htdig & htmerge so carefully calculate and manage are entirely useless. Am I missing something here? -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev