Hi Jamie, All of your suggestions are very valid.
My understanding is that a typical software engineer makes about $45/hr. Naturally, you've set up a fund to reimburse the ht://Dig authors as they leave their permanent jobs and work full-time on ht://Dig to implement the improvements and fixes you've suggested? : ) Dave. > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]]On Behalf Of Jamie > Anstice > Sent: Wednesday, January 02, 2002 7:12 PM > To: [EMAIL PROTECTED] > Subject: [htdig] new year's thoughts > > > It seems to something of a tradition around this time of year to reflect > on the previous year, and talk about the new one. Now I don't hold much > to tradition, but because I'm the only one in the office at the moment, > now is as good a time as any for reflection. As a company, > S.L.I. Systems > makes good use of htdig - we provide custom search solutions, and for > sites with a few thousand documents, htdig is fast & cost-effective. The > availability of the source means that we can be flexible in our indexing > (but it means that we've hacked the HTML parser somewhat). Take this > wishlist as an extended idle musing, rather than an attack on > htdig - as I > said, we've got good use from htdig and we're happy to implement changes > when we need them rather than complain about missing features. However, > I've got some ideas about the directions I'd like to see htdig go (but > keep in mind that I'm talking off my own bat here, rather than the > official representative of S.L.I. Systems). > > There's a bunch of little things I've noticed, but haven't had > the time or > inclination to fix, or tell anyone else about: > * there's a feature with the phrase searching where if the phrase being > queried contains a stop-word then it doesn't match even when there's a > match in the database. I'm not sure that there is a good fix for this - > the only way I can think of getting around this is by indexing absolutely > everything, or by doing the phrase checking by looking at the stored > document texts. > * handling of meta-tags is a bit lacking - it would be good to > be able to > surface metadata in the search results, and be search on the contents of > meta-data fields. (this is on my to-do list for the short term > in the new > year, although it might be a bit specific to our system. Probably it > would be better left from the main htdig until the new parsing code is > active). > * The searching (in 3.2x at least) seems to unduly favour long documents > over short ones, and common words over less common words. Worse, in > multi-word queries, the ranking of results is skewed towards the most > common term (I've implemented a fix for this in parser.cc, but > the code is > too horrible to disclose - it counts the number of distinct query terms, > and gives those documents containing more than one of the terms a score > boost (and more of a boost for more terms). It's not perfect, but it's a > good start. I'll see if I can set up a demo sometime so people can play > with the two settings). > * The HTTP 1.1 persistent connection code is a little conservative - if > it can't find a robots.txt file initially the web server tends to close > the connection. This causes htdig to decide that the server won't to > persistent connections. I've poked the code so that it always > tries for a > persistent connection. > * I've made a change to the noindex settings - I've got noindex2 and > noindex3 (and also linksonly, linksonly2 and linksonly3) which are handy > for getting links out of menu bars without indexing the text). I had a > thought about making some general string list version, but then I decided > that it would be too much trouble keeping track of which of the strings > was active at any one time, and I decided that 3 options would be enough > for anyone (we've only ever used 2 at once). Sometime I'd like to rewrite > the HTML parser to be a single-pass beast, but more on that later. > > Here's a bunch of things I think about every so often, but not (so far) > done anything about (and with a 7 week son in the house I don't > think I'll > be doing much quickly in my spare time). > > * It would be nice to get the latest database code integrated (I've > nearly talked the guys at work around to thinking that this is a > good idea > for us to do, so I'll see if we can get this done sometime in the > new year > - currently we're in revenue-gathering mode, but I'll see when we can get > around to it). It would be nice not to have the db get corrupt after > 30,000 documents or so. > * I'm not so sure that the new parsing scheme of calculating the > score at > search time is such a good idea - while it means that there is a great > deal of flexibility with the scoring algorithms (and let me fix the > multi-word searching without reindexing) it is much slower than the 3.1 > series. I'm not sure that word-level precision is really > necessary either > - phrase searching is probably sufficiently uncommon that it could be > achieved by scanning possible result candidates. > * My latest off-the-wall idea is to move to a generic XML (buzzword! but > possibly appropriate) intermediate format for indexing, and use an > external parser to translate from HTML->digML. This would mean that only > a single internal parser would be required, and it would be > simple, as all > the hard stuff would be done in something like Perl, where it's really > easy to muck about with text - actually we can do something like this now > with the external parser stuff, but I don;t know if it works well with > 3.2x. A general intermediate format would make it easier to > index non-web > page things, and DB-driven web sites, and stuff like that. The > Greenstone > Digital Library project (www.nzdl.org) does something similar, and > initially I thought it was a mad idea, but it grew on me over time. > * It would be nice to be able to have htsearch run as a server > as well as > a CGI. We shouldn't remove the CGI option, as I expect that there are a > lot of people who can't install servers. However, the option would be > nice - commonly accessed bits of the db could cached (like lexicons & > stuff). To get to this place, I suspect that the code would need to be > made threadsafe (we've got a lot of experience writing threaded TCP > servers so that's the way we'd go if we were writing it - others might do > it differently). The easiest way to get there would be to > refactor chunks > of htdig with standard C++ which have known thread behaviour (I don't > blame htdig for not being written in a modern C++ style, as it has its > roots in pre-common-STL days. However, I find working with the STL & > std::string much easier than rolling my own). Fortunately, STLport is > available on pretty much all platforms, so I suspect that an STL-based > htdig wouldn't loose many current users). > * better internationalisation support would be wonderful, but a real > challenge to implement. I'd really like support for greater than 8-bit > character sets, and simultaneous multiple languages would be good too > (IBM's free Unicode code can do both, but would be a bear to > integrate). I > think I'll be waiting for a while on this one. > > Anyway, that's my coffee-cup's worth. I think htDig's a pretty nifty > piece of software. > > > Jamie Anstice > Search Scientist, S.L.I. Systems, Inc > [EMAIL PROTECTED] > ph: 64 961 3262 > mobile: 64 21 264 9347 > > _______________________________________________ > htdig-general mailing list <[EMAIL PROTECTED]> > To unsubscribe, send a message to > <[EMAIL PROTECTED]> with a subject of > unsubscribe > FAQ: http://htdig.sourceforge.net/FAQ.html > _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

