I'll preface this by saying that I'm addressing your comments for the 
3.2 code--most things you mention just can't happen with 3.1.x.

-Geoff

At 1:12 PM +1300 1/3/02, Jamie Anstice wrote:
>phrase searching where if the phrase being queried contains a 
>stop-word then it doesn't match even when there's a match in the 
>database.

Yeah, this is a known bug. The fix isn't as bad as you make it out to 
be if you allow for false positive matches. You do the query, keeping 
track of the word offsets in the phrase query itself. So "Foo and Bar 
Esquire" would search for Bar as +2 and Esquire as +3 relative to 
Foo. Some cases will match with some other word replacing the "and," 
but you certainly won't miss any.

>  * handling of meta-tags is a bit lacking - it would be good to be able to
>surface metadata in the search results, and be search on the contents of
>meta-data fields.

Sure. We need the new htsearch framework in place before this can 
happen, though. Eventually it would be nice to allow users to search 
on customized meta tags and this is possible in the framework of the 
3.2 database, but not implemented.

I'd be glad to talk to people about how the document parsers would 
need to change to do this.

>  * The searching (in 3.2x at least) seems to unduly favour long documents
>over short ones, and common words over less common words. Worse, in
>multi-word queries, the ranking of results is skewed towards the most
>common term

Yes. Working on these sorts of scoring issues will also have to wait 
until the new htsearch code is in place. As you mention, messing with 
the old parser.cc is horrible.

>  * The HTTP 1.1 persistent connection code is a little conservative - if

Does the server seem to close the connection after any sort of 404?

>thought about making some general string list version, but then I decided
>that it would be too much trouble keeping track of which of the strings
>was active at any one time, and I decided that 3 options would be enough

Sounds like you want a stack to me. If the pattern matches one of the 
items in the string list, it's put on the stack and then removed as 
appropriate.

>  * It would be nice to get the latest database code integrated

In *theory* this shouldn't be so bad. But I haven't had the time to 
sift through the mifluz code and figure out how the word database 
works now. It sounds like words are indexed by "word ids" rather than 
the words themselves, but that's as far as I got.

>  * I'm not so sure that the new parsing scheme of calculating the score at
>search time is such a good idea ... it is much slower than the 3.1
>series.

You'd think it would make a difference, wouldn't you? In many ways 
it's really not a big performance hit--and keeping the word tags is 
needed to do meta searching. Otherwise you don't know where the word 
came from.

It turns out that there are so many other speed improvements to 
htsearch that this is insignificant. Caching, smart sorting, not 
loading the excerpt until at the very end, etc. are all *huge* wins.

>I'm not sure that word-level precision is really necessary either

It is if you want to consider things like proximity scoring. I'll 
admit that many aspects of the word database design for 3.2 stemmed 
from what Google published about their database format before they 
went commercial.

>with the external parser stuff, but I don;t know if it works well with
>3.2x.

The current external parser spec has not been updated to keep up with 
all the possibilities in the database backend. At the moment, this is 
more of a feature (not having to rewrite parsers/converters) than a 
bug. I'd worry about parsing speed a bit too.

>  * It would be nice to be able to have htsearch run as a server as well as
>a CGI.

Maybe. There are a lot of improvements that could happen to improve 
CGI speed before we go that route. Quim has been kind enough to write 
an example query caching framework for htsearch as well, which would 
significantly speed up repeat or common queries.

>(IBM's free Unicode code can do both, but would be a bear to integrate)

I think developments in glibc and GNOME in coming months will lead 
towards an easier solution for this. Fortunately ht://Dig is not the 
only GPL'ed software that wants to get Unicode done easily.

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to