According to Jamie Anstice: > * there's a feature with the phrase searching where if the phrase being > queried contains a stop-word then it doesn't match even when there's a > match in the database. I'm not sure that there is a good fix for this - > the only way I can think of getting around this is by indexing absolutely > everything, or by doing the phrase checking by looking at the stored > document texts.
Well, Quim seems to have confirmed that this does work in the current snapshots, although the stop words are treated as wildcards right now. In any case, the query parser is being rewritten, so give it another whirl after it's done. > * handling of meta-tags is a bit lacking - it would be good to be able to > surface metadata in the search results, and be search on the contents of > meta-data fields. (this is on my to-do list for the short term in the new > year, although it might be a bit specific to our system. Probably it > would be better left from the main htdig until the new parsing code is > active). Well, this is certainly on our wish list as well, and has been requested a number of times. To be generally useful, though, it would need to be customizable. I.e, you'd need to be able to tell htdig which meta tags to store, and for each one you'd have to specify: into which database field the content of the tag should go, and into which template variable, if any; which flag value should be used for words in the content of the tag as they go into the word database, and if it's a new/non-standard flag value, what weight will be assigned to it at search time; and finally, how can a user request that this field alone be searched from the search form. I'm not sure which new parsing code you are referring to above. If you mean the new query parser, then yes, that would be needed for the last requirement I mentioned, i.e. to specify how to pick out a particular meta tag value from the search form. The rest of the requirements would call for some simple, incremental changes to the HTML parser. We don't have any immediate plans to rewrite that parser yet, so don't let that hold you back if you want to customise it. There have been a few requests that we implement HTML parsing through a generic XML parser, but that's still a long way away. (Whatever we end up implementing, it will be absolutely mandatory that it correctly handles legacy HTML documents, not just properly conforming XHTML.) > * The searching (in 3.2x at least) seems to unduly favour long documents > over short ones, and common words over less common words. Worse, in > multi-word queries, the ranking of results is skewed towards the most > common term (I've implemented a fix for this in parser.cc, but the code is > too horrible to disclose - it counts the number of distinct query terms, > and gives those documents containing more than one of the terms a score > boost (and more of a boost for more terms). It's not perfect, but it's a > good start. I'll see if I can set up a demo sometime so people can play > with the two settings). You might want to have a look at the multimatch_factor attribute handling code I've added to the 3.1.6 snapshots. It's a bit of an ugly hack too, but does seem to be somewhat useful. If you can help Geoff and Quim get better rankings in the new query parser, that would be wonderful. > * The HTTP 1.1 persistent connection code is a little conservative - if > it can't find a robots.txt file initially the web server tends to close > the connection. This causes htdig to decide that the server won't to > persistent connections. I've poked the code so that it always tries for a > persistent connection. That sounds like a bug to me. Gabriele, are you following this thread? If so, could you look into this problem, please? Geoff asked, does the server seem to close the connection after any sort of 404? I'd add to this, does closing the connection always disable subsequent persistent connections? Only after an error? > * I've made a change to the noindex settings - I've got noindex2 and > noindex3 (and also linksonly, linksonly2 and linksonly3) which are handy > for getting links out of menu bars without indexing the text). I had a > thought about making some general string list version, but then I decided > that it would be too much trouble keeping track of which of the strings > was active at any one time, and I decided that 3 options would be enough > for anyone (we've only ever used 2 at once). Sometime I'd like to rewrite > the HTML parser to be a single-pass beast, but more on that later. It's not clear to me why you'd need to use 2 different tags for turning off indexing, let alone 3. As I said yesterday, the latest snapshots support <noindex follow>, which would by like your <linksonly> tag, and they correctly handle other tags so that, for example, a </script> tag doesn't turn indexing back on if it was turned off by an earlier meta robots tag. Can you provide an example of where you'd need 2 different tags? > Here's a bunch of things I think about every so often, but not (so far) > done anything about (and with a 7 week son in the house I don't think I'll > be doing much quickly in my spare time). Hey, congrats! Don't worry, they do get less demanding as they get older. > * It would be nice to get the latest database code integrated (I've > nearly talked the guys at work around to thinking that this is a good idea > for us to do, so I'll see if we can get this done sometime in the new year > - currently we're in revenue-gathering mode, but I'll see when we can get > around to it). It would be nice not to have the db get corrupt after > 30,000 documents or so. This is certainly already high up on our list of priorities for 3.2.0b4. I believe the latest mifluz library includes the latest Berkeley DB code plus its word list compression code. > * I'm not so sure that the new parsing scheme of calculating the score at > search time is such a good idea - while it means that there is a great > deal of flexibility with the scoring algorithms (and let me fix the > multi-word searching without reindexing) it is much slower than the 3.1 > series. I'm not sure that word-level precision is really necessary either > - phrase searching is probably sufficiently uncommon that it could be > achieved by scanning possible result candidates. Are you saying that 3.2 is slower at search time, or overall including indexing time? True, indexing is much slower in 3.2, but in some cases searching in 3.2 is faster than 3.1, because the changes in database layouts often more than make up for the time it takes to search a bigger word database. 3.1's htsearch was awfully slow when it had a lot of matches and had to look up db.docs.index and db.docdb records for all of them. > * My latest off-the-wall idea is to move to a generic XML (buzzword! but > possibly appropriate) intermediate format for indexing, and use an > external parser to translate from HTML->digML. This would mean that only > a single internal parser would be required, and it would be simple, as all > the hard stuff would be done in something like Perl, where it's really > easy to muck about with text - actually we can do something like this now > with the external parser stuff, but I don;t know if it works well with > 3.2x. A general intermediate format would make it easier to index non-web > page things, and DB-driven web sites, and stuff like that. The Greenstone > Digital Library project (www.nzdl.org) does something similar, and > initially I thought it was a mad idea, but it grew on me over time. I'd be worried about the performance hit htdig would take as a result of this. I think for most users, HTML documents make up the vast majority of what they index, so we want this parser to be fast. Requiring an external converter for HTML would really slow down indexing. Having a single internal parser is certainly a good idea, but I think that parser should be able to handle all current flavours of HTML natively. > * It would be nice to be able to have htsearch run as a server as well as > a CGI. We shouldn't remove the CGI option, as I expect that there are a As Geoff said, there are lots of other ways to improve the speed. However, if you want to implement this we'd certainly consider your mods for the next release. > * better internationalisation support would be wonderful, but a real > challenge to implement. I'd really like support for greater than 8-bit > character sets, and simultaneous multiple languages would be good too > (IBM's free Unicode code can do both, but would be a bear to integrate). I > think I'll be waiting for a while on this one. This is certainly on our wish list as well. Hopefully the glibc changes Geoff talked about will help. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

