I'll preface this by saying that I'm addressing your comments for the 3.2 code--most things you mention just can't happen with 3.1.x.
-Geoff At 1:12 PM +1300 1/3/02, Jamie Anstice wrote: >phrase searching where if the phrase being queried contains a >stop-word then it doesn't match even when there's a match in the >database. Yeah, this is a known bug. The fix isn't as bad as you make it out to be if you allow for false positive matches. You do the query, keeping track of the word offsets in the phrase query itself. So "Foo and Bar Esquire" would search for Bar as +2 and Esquire as +3 relative to Foo. Some cases will match with some other word replacing the "and," but you certainly won't miss any. > * handling of meta-tags is a bit lacking - it would be good to be able to >surface metadata in the search results, and be search on the contents of >meta-data fields. Sure. We need the new htsearch framework in place before this can happen, though. Eventually it would be nice to allow users to search on customized meta tags and this is possible in the framework of the 3.2 database, but not implemented. I'd be glad to talk to people about how the document parsers would need to change to do this. > * The searching (in 3.2x at least) seems to unduly favour long documents >over short ones, and common words over less common words. Worse, in >multi-word queries, the ranking of results is skewed towards the most >common term Yes. Working on these sorts of scoring issues will also have to wait until the new htsearch code is in place. As you mention, messing with the old parser.cc is horrible. > * The HTTP 1.1 persistent connection code is a little conservative - if Does the server seem to close the connection after any sort of 404? >thought about making some general string list version, but then I decided >that it would be too much trouble keeping track of which of the strings >was active at any one time, and I decided that 3 options would be enough Sounds like you want a stack to me. If the pattern matches one of the items in the string list, it's put on the stack and then removed as appropriate. > * It would be nice to get the latest database code integrated In *theory* this shouldn't be so bad. But I haven't had the time to sift through the mifluz code and figure out how the word database works now. It sounds like words are indexed by "word ids" rather than the words themselves, but that's as far as I got. > * I'm not so sure that the new parsing scheme of calculating the score at >search time is such a good idea ... it is much slower than the 3.1 >series. You'd think it would make a difference, wouldn't you? In many ways it's really not a big performance hit--and keeping the word tags is needed to do meta searching. Otherwise you don't know where the word came from. It turns out that there are so many other speed improvements to htsearch that this is insignificant. Caching, smart sorting, not loading the excerpt until at the very end, etc. are all *huge* wins. >I'm not sure that word-level precision is really necessary either It is if you want to consider things like proximity scoring. I'll admit that many aspects of the word database design for 3.2 stemmed from what Google published about their database format before they went commercial. >with the external parser stuff, but I don;t know if it works well with >3.2x. The current external parser spec has not been updated to keep up with all the possibilities in the database backend. At the moment, this is more of a feature (not having to rewrite parsers/converters) than a bug. I'd worry about parsing speed a bit too. > * It would be nice to be able to have htsearch run as a server as well as >a CGI. Maybe. There are a lot of improvements that could happen to improve CGI speed before we go that route. Quim has been kind enough to write an example query caching framework for htsearch as well, which would significantly speed up repeat or common queries. >(IBM's free Unicode code can do both, but would be a bear to integrate) I think developments in glibc and GNOME in coming months will lead towards an easier solution for this. Fortunately ht://Dig is not the only GPL'ed software that wants to get Unicode done easily. _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

