[htdig] new year's thoughts

Jamie Anstice Wed, 02 Jan 2002 16:13:40 -0800

It seems to something of a tradition around this time of year to reflect 
on the previous year, and talk about the new one.  Now I don't hold much 
to tradition, but because I'm the only one in the office at the moment, 
now is as good a time as any for reflection.  As a company, S.L.I. Systems 
makes good use of htdig - we provide custom search solutions, and for 
sites with a few thousand documents, htdig is fast & cost-effective.  The 
availability of the source means that we can be flexible in our indexing 
(but it means that we've hacked the HTML parser somewhat).  Take this 
wishlist as an extended idle musing, rather than an attack on htdig - as I 
said, we've got good use from htdig and we're happy to implement changes 
when we need them rather than complain about missing features.  However, 
I've got some ideas about the directions I'd like to see htdig go (but 
keep in mind that I'm talking off my own bat here, rather than the 
official representative of S.L.I. Systems).


There's a bunch of little things I've noticed, but haven't had the time or 
inclination to fix, or tell anyone else about:
 * there's a feature with the phrase searching where if the phrase being 
queried contains a stop-word then it doesn't match even when there's a 
match in the database.  I'm not sure that there is a good fix for this - 
the only way I can think of getting around this is by indexing absolutely 
everything, or by doing the phrase checking by looking at the stored 
document texts.
 * handling of meta-tags is a bit lacking - it would be good to be able to 
surface metadata in the search results, and be search on the contents of 
meta-data fields.  (this is on my to-do list for the short term in the new 
year, although it might be a bit specific to our system.  Probably it 
would be better left from the main htdig until the new parsing code is 
active).
 * The searching (in 3.2x at least) seems to unduly favour long documents 
over short ones, and common words over less common words. Worse, in 
multi-word queries, the ranking of results is skewed towards the most 
common term (I've implemented a fix for this in parser.cc, but the code is 
too horrible to disclose - it counts the number of distinct query terms, 
and gives those documents containing more than one of the terms a score 
boost (and more of a boost for more terms). It's not perfect, but it's a 
good start.  I'll see if I can set up a demo sometime so people can play 
with the two settings).
 * The HTTP 1.1 persistent connection code is a little conservative - if 
it can't find a robots.txt file initially the web server tends to close 
the connection.  This causes htdig to decide that the server won't to 
persistent connections.  I've poked the code so that it always tries for a 
persistent connection.
 * I've made a change to the noindex settings - I've got noindex2 and 
noindex3 (and also linksonly, linksonly2 and linksonly3) which are handy 
for getting links out of menu bars without indexing the text).  I had a 
thought about making some general string list version, but then I decided 
that it would be too much trouble keeping track of which of the strings 
was active at any one time, and I decided that 3 options would be enough 
for anyone (we've only ever used 2 at once). Sometime I'd like to rewrite 
the HTML parser to be a single-pass beast, but more on that later.

Here's a bunch of things I think about every so often, but not (so far) 
done anything about (and with a 7 week son in the house I don't think I'll 
be doing much quickly in my spare time).

 * It would be nice to get the latest database code integrated (I've 
nearly talked the guys at work around to thinking that this is a good idea 
for us to do, so I'll see if we can get this done sometime in the new year 
- currently we're in revenue-gathering mode, but I'll see when we can get 
around to it). It would be nice not to have the db get corrupt after 
30,000 documents or so.
 * I'm not so sure that the new parsing scheme of calculating the score at 
search time is such a good idea - while it means that there is a great 
deal of flexibility with the scoring algorithms (and let me fix the 
multi-word searching without reindexing) it is much slower than the 3.1 
series.  I'm not sure that word-level precision is really necessary either 
- phrase searching is probably sufficiently uncommon that it could be 
achieved by scanning possible result candidates.
 * My latest off-the-wall idea is to move to a generic XML (buzzword! but 
possibly appropriate) intermediate format for indexing, and use an 
external parser to translate from HTML->digML.  This would mean that only 
a single internal parser would be required, and it would be simple, as all 
the hard stuff would be done in something like Perl, where it's really 
easy to muck about with text - actually we can do something like this now 
with the external parser stuff, but I don;t know if it works well with 
3.2x.  A general intermediate format would make it easier to index non-web 
page things, and DB-driven web sites, and stuff like that.  The Greenstone 
Digital Library project (www.nzdl.org) does something similar, and 
initially I thought it was a mad idea, but it grew on me over time.
 * It would be nice to be able to have htsearch run as a server as well as 
a CGI.  We shouldn't remove the CGI option, as I expect that there are a 
lot of people who can't install servers.  However, the option would be 
nice - commonly accessed bits of the db could cached (like lexicons & 
stuff).  To get to this place, I suspect that the code would need to be 
made threadsafe (we've got a lot of experience writing threaded TCP 
servers so that's the way we'd go if we were writing it - others might do 
it differently).  The easiest way to get there would be to refactor chunks 
of htdig with standard C++ which have known thread behaviour (I don't 
blame htdig for not being written in a modern C++ style, as it has its 
roots in pre-common-STL days.  However, I find working with the STL & 
std::string much easier than rolling my own). Fortunately, STLport is 
available on pretty much all platforms, so I suspect that an STL-based 
htdig wouldn't loose many current users).
 * better internationalisation support would be wonderful, but a real 
challenge to implement.  I'd really like support for greater than 8-bit 
character sets, and simultaneous multiple languages would be good too 
(IBM's free Unicode code can do both, but would be a bear to integrate). I 
think I'll be waiting for a while on this one.

Anyway, that's my coffee-cup's worth.  I think htDig's a pretty nifty 
piece of software.


Jamie Anstice
Search Scientist,  S.L.I. Systems, Inc
[EMAIL PROTECTED]
ph:  64 961 3262
mobile: 64 21 264 9347

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

[htdig] new year's thoughts

Reply via email to