It seems to something of a tradition around this time of year to reflect on the previous year, and talk about the new one. Now I don't hold much to tradition, but because I'm the only one in the office at the moment, now is as good a time as any for reflection. As a company, S.L.I. Systems makes good use of htdig - we provide custom search solutions, and for sites with a few thousand documents, htdig is fast & cost-effective. The availability of the source means that we can be flexible in our indexing (but it means that we've hacked the HTML parser somewhat). Take this wishlist as an extended idle musing, rather than an attack on htdig - as I said, we've got good use from htdig and we're happy to implement changes when we need them rather than complain about missing features. However, I've got some ideas about the directions I'd like to see htdig go (but keep in mind that I'm talking off my own bat here, rather than the official representative of S.L.I. Systems).
There's a bunch of little things I've noticed, but haven't had the time or inclination to fix, or tell anyone else about: * there's a feature with the phrase searching where if the phrase being queried contains a stop-word then it doesn't match even when there's a match in the database. I'm not sure that there is a good fix for this - the only way I can think of getting around this is by indexing absolutely everything, or by doing the phrase checking by looking at the stored document texts. * handling of meta-tags is a bit lacking - it would be good to be able to surface metadata in the search results, and be search on the contents of meta-data fields. (this is on my to-do list for the short term in the new year, although it might be a bit specific to our system. Probably it would be better left from the main htdig until the new parsing code is active). * The searching (in 3.2x at least) seems to unduly favour long documents over short ones, and common words over less common words. Worse, in multi-word queries, the ranking of results is skewed towards the most common term (I've implemented a fix for this in parser.cc, but the code is too horrible to disclose - it counts the number of distinct query terms, and gives those documents containing more than one of the terms a score boost (and more of a boost for more terms). It's not perfect, but it's a good start. I'll see if I can set up a demo sometime so people can play with the two settings). * The HTTP 1.1 persistent connection code is a little conservative - if it can't find a robots.txt file initially the web server tends to close the connection. This causes htdig to decide that the server won't to persistent connections. I've poked the code so that it always tries for a persistent connection. * I've made a change to the noindex settings - I've got noindex2 and noindex3 (and also linksonly, linksonly2 and linksonly3) which are handy for getting links out of menu bars without indexing the text). I had a thought about making some general string list version, but then I decided that it would be too much trouble keeping track of which of the strings was active at any one time, and I decided that 3 options would be enough for anyone (we've only ever used 2 at once). Sometime I'd like to rewrite the HTML parser to be a single-pass beast, but more on that later. Here's a bunch of things I think about every so often, but not (so far) done anything about (and with a 7 week son in the house I don't think I'll be doing much quickly in my spare time). * It would be nice to get the latest database code integrated (I've nearly talked the guys at work around to thinking that this is a good idea for us to do, so I'll see if we can get this done sometime in the new year - currently we're in revenue-gathering mode, but I'll see when we can get around to it). It would be nice not to have the db get corrupt after 30,000 documents or so. * I'm not so sure that the new parsing scheme of calculating the score at search time is such a good idea - while it means that there is a great deal of flexibility with the scoring algorithms (and let me fix the multi-word searching without reindexing) it is much slower than the 3.1 series. I'm not sure that word-level precision is really necessary either - phrase searching is probably sufficiently uncommon that it could be achieved by scanning possible result candidates. * My latest off-the-wall idea is to move to a generic XML (buzzword! but possibly appropriate) intermediate format for indexing, and use an external parser to translate from HTML->digML. This would mean that only a single internal parser would be required, and it would be simple, as all the hard stuff would be done in something like Perl, where it's really easy to muck about with text - actually we can do something like this now with the external parser stuff, but I don;t know if it works well with 3.2x. A general intermediate format would make it easier to index non-web page things, and DB-driven web sites, and stuff like that. The Greenstone Digital Library project (www.nzdl.org) does something similar, and initially I thought it was a mad idea, but it grew on me over time. * It would be nice to be able to have htsearch run as a server as well as a CGI. We shouldn't remove the CGI option, as I expect that there are a lot of people who can't install servers. However, the option would be nice - commonly accessed bits of the db could cached (like lexicons & stuff). To get to this place, I suspect that the code would need to be made threadsafe (we've got a lot of experience writing threaded TCP servers so that's the way we'd go if we were writing it - others might do it differently). The easiest way to get there would be to refactor chunks of htdig with standard C++ which have known thread behaviour (I don't blame htdig for not being written in a modern C++ style, as it has its roots in pre-common-STL days. However, I find working with the STL & std::string much easier than rolling my own). Fortunately, STLport is available on pretty much all platforms, so I suspect that an STL-based htdig wouldn't loose many current users). * better internationalisation support would be wonderful, but a real challenge to implement. I'd really like support for greater than 8-bit character sets, and simultaneous multiple languages would be good too (IBM's free Unicode code can do both, but would be a bear to integrate). I think I'll be waiting for a while on this one. Anyway, that's my coffee-cup's worth. I think htDig's a pretty nifty piece of software. Jamie Anstice Search Scientist, S.L.I. Systems, Inc [EMAIL PROTECTED] ph: 64 961 3262 mobile: 64 21 264 9347 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

