RE: Thanks, problem with [email protected] search, and deve lopm ent

Albert . Langer Fri, 13 Nov 1998 08:58:11 -0500

[JB]
>Expiry dates [to improve caching of static pages]
I'm not familiar with how they fit into the HTTP/HTML specification.
If you (or anyone) can provide the appropriate HTML meta tags with
a short explanation, I will add them to message pages. Index
pages may not be so easy as they are often rebuilt (limitation of
MHonArc)


[AL]
Sorry, cannot help. HTTP/HTML specs should be easy to get. Anyone on a
list with people familiar with details of interactions between expiry
dates and caches (ISP and end user) should be able to advise on correct
useage - if nobody on gossip list responds as a result of copying your
question here. Would need careful testing of actual results - caches
behave weirdly.

Although not knowing precisely HOW to do it, I'm pretty confident that
having static message pages indefinately cacheable would significantly
improve international performance. But only AFTER they become really
"static" ie when the thread links for both date and subject have been
updated - otherwise the first cache copy without the "next" links might
be used "forever" (until user discovers "shift-refresh") if somebody
sharing an ISP cache happens to access first version of message before
the "next" links are created.

Also pretty sure the OPPOSITE (very short cache expiry) would be
desirable for index pages or people would keep seeing old indexes. 

May be appropriate to set on Apache folders as part of HTTP protocol
rather than within HTML? (Noticed that MHonArc has ability to set
modtimestamp of file based on message date, which could be used to
interact with webserver specifying HTTP expiry time relative to
modtime.)

Don't know what happens when you just leave it at default.

[JB]
>[htdig will] get [all] the HTML files (or just the new ones??)
I think it looks at timestamps of every HTML file. Not positive.

[AL]
Hmm, you mentioned earlier that:
"The search engine runs as a batch process once a week.  Last
Wednesday it took about 10 hours to process 110,000 messages."
Does that mean you had 110,000 total online or received 110,000 new that
week?
Incremental indexes would be nice if feasible (still haven't read
Ht://Dig docs).
Daily (or hourly) batches would then make searches much less out of
date.
Looks like only a factor of 17 increase in scale before you would be
running indexing 17x10 = 170 hours each 24x7 = 168 hour week, so some
redesign may be needed if rapidly expanding.

BTW I got the impression from some discussion following:
 http://www.mail-archive.com/[email protected]/msg00090.html
that wilma with glimpse allows for monthly indexes and an index of
indexes (and that the version required for use with MHonArc is free -
contrary to above message).

Is there a particular reason for choosing Ht://Dig?

[JB]
>rcfile...digger.model
There is some interplay, but only in the CGI form for searching

[AL]
If you are ever re-organizing it and get a chance to put the common
stuff in a separate message.model dateindx.model and searchindx.model
searchbutton.model and common.model which are used to install both
rcfile and (related parts of) digger (removed from digger.model), that
could make it easier to customize by knowing that only 1 file ever needs
to be changed. Also might then be made easier for customizer to
visualize effect of changing message, dateindx and searchindx display
formats by separating what the customizer enters from the MHonArc rcfile
weirdnesses and keeping the paramaters you really need to understand the
meaning of in common.model.

At same time might be useful to switch from "install time" configuration
to "run time" (i.e. mailmed.init) with ability to use different configs
for different lists - could make it easier to try out alternatives
suited to particular lists.

Isolating the search button stuff into one place might make it easier to
take advantage of one apparant advantage of Ht://Dig over glimpse -
namely possibility of running the indexing and response to search
requests on a different server at a low bandwidth location from the one
that actually serves up the static pages from a high bandwidth location.
Combined with proper caching this might mean lots of remote sites could
just install the (incremental) indexing to provide users with the
ability to do a very quick search and reasonably quick access to actual
docs if they happen to already be in cache (eg from recent index run).

(Not familiar enough with glimpse and Ht://Dig yet to know if that
possibility really is a difference between them).

RE: Thanks, problem with [email protected] search, and deve lopm ent

Reply via email to