RE: [htdig] new year's thoughts

David T. Ashley Wed, 02 Jan 2002 16:26:08 -0800

Hi Jamie,

All of your suggestions are very valid.


My understanding is that a typical software engineer makes about $45/hr.

Naturally, you've set up a fund to reimburse the ht://Dig authors as they
leave their permanent jobs and work full-time on ht://Dig to implement the
improvements and fixes you've suggested?

: )

Dave.

> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Jamie
> Anstice
> Sent: Wednesday, January 02, 2002 7:12 PM
> To: [EMAIL PROTECTED]
> Subject: [htdig] new year's thoughts
>
>
> It seems to something of a tradition around this time of year to reflect
> on the previous year, and talk about the new one.  Now I don't hold much
> to tradition, but because I'm the only one in the office at the moment,
> now is as good a time as any for reflection.  As a company,
> S.L.I. Systems
> makes good use of htdig - we provide custom search solutions, and for
> sites with a few thousand documents, htdig is fast & cost-effective.  The
> availability of the source means that we can be flexible in our indexing
> (but it means that we've hacked the HTML parser somewhat).  Take this
> wishlist as an extended idle musing, rather than an attack on
> htdig - as I
> said, we've got good use from htdig and we're happy to implement changes
> when we need them rather than complain about missing features.  However,
> I've got some ideas about the directions I'd like to see htdig go (but
> keep in mind that I'm talking off my own bat here, rather than the
> official representative of S.L.I. Systems).
>
> There's a bunch of little things I've noticed, but haven't had
> the time or
> inclination to fix, or tell anyone else about:
>  * there's a feature with the phrase searching where if the phrase being
> queried contains a stop-word then it doesn't match even when there's a
> match in the database.  I'm not sure that there is a good fix for this -
> the only way I can think of getting around this is by indexing absolutely
> everything, or by doing the phrase checking by looking at the stored
> document texts.
>  * handling of meta-tags is a bit lacking - it would be good to
> be able to
> surface metadata in the search results, and be search on the contents of
> meta-data fields.  (this is on my to-do list for the short term
> in the new
> year, although it might be a bit specific to our system.  Probably it
> would be better left from the main htdig until the new parsing code is
> active).
>  * The searching (in 3.2x at least) seems to unduly favour long documents
> over short ones, and common words over less common words. Worse, in
> multi-word queries, the ranking of results is skewed towards the most
> common term (I've implemented a fix for this in parser.cc, but
> the code is
> too horrible to disclose - it counts the number of distinct query terms,
> and gives those documents containing more than one of the terms a score
> boost (and more of a boost for more terms). It's not perfect, but it's a
> good start.  I'll see if I can set up a demo sometime so people can play
> with the two settings).
>  * The HTTP 1.1 persistent connection code is a little conservative - if
> it can't find a robots.txt file initially the web server tends to close
> the connection.  This causes htdig to decide that the server won't to
> persistent connections.  I've poked the code so that it always
> tries for a
> persistent connection.
>  * I've made a change to the noindex settings - I've got noindex2 and
> noindex3 (and also linksonly, linksonly2 and linksonly3) which are handy
> for getting links out of menu bars without indexing the text).  I had a
> thought about making some general string list version, but then I decided
> that it would be too much trouble keeping track of which of the strings
> was active at any one time, and I decided that 3 options would be enough
> for anyone (we've only ever used 2 at once). Sometime I'd like to rewrite
> the HTML parser to be a single-pass beast, but more on that later.
>
> Here's a bunch of things I think about every so often, but not (so far)
> done anything about (and with a 7 week son in the house I don't
> think I'll
> be doing much quickly in my spare time).
>
>  * It would be nice to get the latest database code integrated (I've
> nearly talked the guys at work around to thinking that this is a
> good idea
> for us to do, so I'll see if we can get this done sometime in the
> new year
> - currently we're in revenue-gathering mode, but I'll see when we can get
> around to it). It would be nice not to have the db get corrupt after
> 30,000 documents or so.
>  * I'm not so sure that the new parsing scheme of calculating the
> score at
> search time is such a good idea - while it means that there is a great
> deal of flexibility with the scoring algorithms (and let me fix the
> multi-word searching without reindexing) it is much slower than the 3.1
> series.  I'm not sure that word-level precision is really
> necessary either
> - phrase searching is probably sufficiently uncommon that it could be
> achieved by scanning possible result candidates.
>  * My latest off-the-wall idea is to move to a generic XML (buzzword! but
> possibly appropriate) intermediate format for indexing, and use an
> external parser to translate from HTML->digML.  This would mean that only
> a single internal parser would be required, and it would be
> simple, as all
> the hard stuff would be done in something like Perl, where it's really
> easy to muck about with text - actually we can do something like this now
> with the external parser stuff, but I don;t know if it works well with
> 3.2x.  A general intermediate format would make it easier to
> index non-web
> page things, and DB-driven web sites, and stuff like that.  The
> Greenstone
> Digital Library project (www.nzdl.org) does something similar, and
> initially I thought it was a mad idea, but it grew on me over time.
>  * It would be nice to be able to have htsearch run as a server
> as well as
> a CGI.  We shouldn't remove the CGI option, as I expect that there are a
> lot of people who can't install servers.  However, the option would be
> nice - commonly accessed bits of the db could cached (like lexicons &
> stuff).  To get to this place, I suspect that the code would need to be
> made threadsafe (we've got a lot of experience writing threaded TCP
> servers so that's the way we'd go if we were writing it - others might do
> it differently).  The easiest way to get there would be to
> refactor chunks
> of htdig with standard C++ which have known thread behaviour (I don't
> blame htdig for not being written in a modern C++ style, as it has its
> roots in pre-common-STL days.  However, I find working with the STL &
> std::string much easier than rolling my own). Fortunately, STLport is
> available on pretty much all platforms, so I suspect that an STL-based
> htdig wouldn't loose many current users).
>  * better internationalisation support would be wonderful, but a real
> challenge to implement.  I'd really like support for greater than 8-bit
> character sets, and simultaneous multiple languages would be good too
> (IBM's free Unicode code can do both, but would be a bear to
> integrate). I
> think I'll be waiting for a while on this one.
>
> Anyway, that's my coffee-cup's worth.  I think htDig's a pretty nifty
> piece of software.
>
>
> Jamie Anstice
> Search Scientist,  S.L.I. Systems, Inc
> [EMAIL PROTECTED]
> ph:  64 961 3262
> mobile: 64 21 264 9347
>
> _______________________________________________
> htdig-general mailing list <[EMAIL PROTECTED]>
> To unsubscribe, send a message to
> <[EMAIL PROTECTED]> with a subject of
> unsubscribe
> FAQ: http://htdig.sourceforge.net/FAQ.html
>


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

RE: [htdig] new year's thoughts

Reply via email to