Re: [htdig] new year's thoughts

Gilles Detillieux Fri, 04 Jan 2002 13:03:49 -0800

According to Jamie Anstice:
>  * there's a feature with the phrase searching where if the phrase being 
> queried contains a stop-word then it doesn't match even when there's a 
> match in the database.  I'm not sure that there is a good fix for this - 
> the only way I can think of getting around this is by indexing absolutely 
> everything, or by doing the phrase checking by looking at the stored 
> document texts.


Well, Quim seems to have confirmed that this does work in the current
snapshots, although the stop words are treated as wildcards right now.
In any case, the query parser is being rewritten, so give it another whirl
after it's done.

>  * handling of meta-tags is a bit lacking - it would be good to be able to 
> surface metadata in the search results, and be search on the contents of 
> meta-data fields.  (this is on my to-do list for the short term in the new 
> year, although it might be a bit specific to our system.  Probably it 
> would be better left from the main htdig until the new parsing code is 
> active).

Well, this is certainly on our wish list as well, and has been requested
a number of times.  To be generally useful, though, it would need to be
customizable.  I.e, you'd need to be able to tell htdig which meta tags
to store, and for each one you'd have to specify: into which database
field the content of the tag should go, and into which template variable,
if any; which flag value should be used for words in the content of the
tag as they go into the word database, and if it's a new/non-standard flag
value, what weight will be assigned to it at search time; and finally, how
can a user request that this field alone be searched from the search form.

I'm not sure which new parsing code you are referring to above.  If you
mean the new query parser, then yes, that would be needed for the last
requirement I mentioned, i.e. to specify how to pick out a particular
meta tag value from the search form.  The rest of the requirements
would call for some simple, incremental changes to the HTML parser.
We don't have any immediate plans to rewrite that parser yet, so don't
let that hold you back if you want to customise it.  There have been a
few requests that we implement HTML parsing through a generic XML parser,
but that's still a long way away.  (Whatever we end up implementing,
it will be absolutely mandatory that it correctly handles legacy HTML
documents, not just properly conforming XHTML.)

>  * The searching (in 3.2x at least) seems to unduly favour long documents 
> over short ones, and common words over less common words. Worse, in 
> multi-word queries, the ranking of results is skewed towards the most 
> common term (I've implemented a fix for this in parser.cc, but the code is 
> too horrible to disclose - it counts the number of distinct query terms, 
> and gives those documents containing more than one of the terms a score 
> boost (and more of a boost for more terms). It's not perfect, but it's a 
> good start.  I'll see if I can set up a demo sometime so people can play 
> with the two settings).

You might want to have a look at the multimatch_factor attribute handling
code I've added to the 3.1.6 snapshots.  It's a bit of an ugly hack too,
but does seem to be somewhat useful.  If you can help Geoff and Quim get
better rankings in the new query parser, that would be wonderful.

>  * The HTTP 1.1 persistent connection code is a little conservative - if 
> it can't find a robots.txt file initially the web server tends to close 
> the connection.  This causes htdig to decide that the server won't to 
> persistent connections.  I've poked the code so that it always tries for a 
> persistent connection.

That sounds like a bug to me.  Gabriele, are you following this thread?
If so, could you look into this problem, please?

Geoff asked, does the server seem to close the connection after any sort
of 404?  I'd add to this, does closing the connection always disable
subsequent persistent connections?  Only after an error?

>  * I've made a change to the noindex settings - I've got noindex2 and 
> noindex3 (and also linksonly, linksonly2 and linksonly3) which are handy 
> for getting links out of menu bars without indexing the text).  I had a 
> thought about making some general string list version, but then I decided 
> that it would be too much trouble keeping track of which of the strings 
> was active at any one time, and I decided that 3 options would be enough 
> for anyone (we've only ever used 2 at once). Sometime I'd like to rewrite 
> the HTML parser to be a single-pass beast, but more on that later.

It's not clear to me why you'd need to use 2 different tags for turning
off indexing, let alone 3.  As I said yesterday, the latest snapshots
support <noindex follow>, which would by like your <linksonly> tag, and
they correctly handle other tags so that, for example, a </script> tag
doesn't turn indexing back on if it was turned off by an earlier meta
robots tag.

Can you provide an example of where you'd need 2 different tags?

> Here's a bunch of things I think about every so often, but not (so far) 
> done anything about (and with a 7 week son in the house I don't think I'll 
> be doing much quickly in my spare time).

Hey, congrats!  Don't worry, they do get less demanding as they get older.

>  * It would be nice to get the latest database code integrated (I've 
> nearly talked the guys at work around to thinking that this is a good idea 
> for us to do, so I'll see if we can get this done sometime in the new year 
> - currently we're in revenue-gathering mode, but I'll see when we can get 
> around to it). It would be nice not to have the db get corrupt after 
> 30,000 documents or so.

This is certainly already high up on our list of priorities for 3.2.0b4.
I believe the latest mifluz library includes the latest Berkeley DB code
plus its word list compression code.

>  * I'm not so sure that the new parsing scheme of calculating the score at 
> search time is such a good idea - while it means that there is a great 
> deal of flexibility with the scoring algorithms (and let me fix the 
> multi-word searching without reindexing) it is much slower than the 3.1 
> series.  I'm not sure that word-level precision is really necessary either 
> - phrase searching is probably sufficiently uncommon that it could be 
> achieved by scanning possible result candidates.

Are you saying that 3.2 is slower at search time, or overall including
indexing time?  True, indexing is much slower in 3.2, but in some cases
searching in 3.2 is faster than 3.1, because the changes in database
layouts often more than make up for the time it takes to search a bigger
word database.  3.1's htsearch was awfully slow when it had a lot of
matches and had to look up db.docs.index and db.docdb records for all
of them.

>  * My latest off-the-wall idea is to move to a generic XML (buzzword! but 
> possibly appropriate) intermediate format for indexing, and use an 
> external parser to translate from HTML->digML.  This would mean that only 
> a single internal parser would be required, and it would be simple, as all 
> the hard stuff would be done in something like Perl, where it's really 
> easy to muck about with text - actually we can do something like this now 
> with the external parser stuff, but I don;t know if it works well with 
> 3.2x.  A general intermediate format would make it easier to index non-web 
> page things, and DB-driven web sites, and stuff like that.  The Greenstone 
> Digital Library project (www.nzdl.org) does something similar, and 
> initially I thought it was a mad idea, but it grew on me over time.

I'd be worried about the performance hit htdig would take as a result of
this.  I think for most users, HTML documents make up the vast majority
of what they index, so we want this parser to be fast.  Requiring an
external converter for HTML would really slow down indexing.

Having a single internal parser is certainly a good idea, but I think that
parser should be able to handle all current flavours of HTML natively.

>  * It would be nice to be able to have htsearch run as a server as well as 
> a CGI.  We shouldn't remove the CGI option, as I expect that there are a 

As Geoff said, there are lots of other ways to improve the speed.  However,
if you want to implement this we'd certainly consider your mods for the
next release.

>  * better internationalisation support would be wonderful, but a real 
> challenge to implement.  I'd really like support for greater than 8-bit 
> character sets, and simultaneous multiple languages would be good too 
> (IBM's free Unicode code can do both, but would be a bear to integrate). I 
> think I'll be waiting for a while on this one.

This is certainly on our wish list as well.  Hopefully the glibc changes
Geoff talked about will help.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] new year's thoughts

Reply via email to