RE: htdig, general customization

Albert . Langer Sun, 15 Nov 1998 17:36:13 -0500

[JB]
Total archive size is ~110,000 messages. Growth is currently under
1000 per day. There may be room for improving search indexing
efficiency (incrementallity?), which could deliver tangible benefits, as
you mentioned.


[AL]
What is the corresponding average number of Web requests (for messages
and for indexes) per day?

Presumably dividing that number by each of the total archive size and
new messages per day would provide two major indicators that could be
correlated with expected future web traffic, but would be misleading
without a pattern for Web requests as a function of the length of time
the message has been stored. Do you have any way to extract that "decay"
pattern from Apache logs correlated with timestamps on corresponding
message files?

Also what is the 24 hour and 7 day pattern for web requests and for new
message arrivals to estimate their peak periods for capacity
projections?

If for example one used the crude 80/20 rule to suppose that there are
currently 800 new messages arriving over 5 peak period hours (approx 20%
of the day), that is 160 per hour. Your FAQ estimates a capacity for
5,000 messages per hour with current configuration so (assuming
unchanged distribution of list sizes etc) that would allow scaling to
about a factor of 60 - say from 150 lists to 10,000 (still a very small
part of the total of half a million or so estimated to already exist
prior to expected explosion in list use when the enormous numbers of
people still learning to "surf the web" discover that actual
participation through lists is more interesting).

Earlier calculations show factor of 17 in total size before continuous
weekly indexing necessary.
This suggests incremental indexing more urgent than improvements in new
message per hour capacity. Even more so since total size will grow just
from age of archives as well as from growth in new messages.
 
Would be more difficult but interesting to also do stats on correlations
between average Web requests per message per week, new messages per
week, number of subscribers, age of list, age of list archive and number
of posters who generate various proportions of the posted messages -
also standard deviations etc (and breakdowns by categories of list when
these are established).
(Some of these figures may be available from surveys of major lists,
correlation with archive web requests is the thing your analysis could
contribute. There are probably lots of social science types doing
"internet" related projects who could be encouraged to look into this.)

[JB]
I chose htDig because it appeared to beat out the competition by being
cleanly implemented, documented, robust, credible, customizable, well
supported, actively maintained, very fast at searching, linux
compatible, open source, widely used, and installable via RPM.  Its
biggest drawbacks are that indexing (as currently configured) takes
time, and that it uses a good chunk of disk space for its indexes.
(I'd rather buy more disk space than a bigger CPU, though)

[AL]
Sounds like good reasons! Long term speed of searching will be more
important than speed of indexing and disk space. More below.

(Catching up with items not replied to in private email from earlier
message:
http://www.mail-archive.com/[email protected]/msg00041.html )

[JB]
Ideally, messages should be indexed for searching as soon as they are
received. However, if this is not practical (due to computational
expense) your solution is a good one. 

There still may be some confusion, as the search engine will
only match the first message. Still, this is better than the current
situation.

[AL] 
Thanks for having already fixed error message problem so promptly!

Incremental indexing (either for each batch of messages or in frequent
larger batches such as daily or hourly) would not be just "ideal" but an
eventually essential significant enhancement over weekly index runs
which must be inherently out of date and very wasteful in re-processing
entire total number of messages instead of just new messages.

I still haven't checked out Ht://Dig and glimpse or other alternatives
but would guess that avoiding excessive computational expense would
require search engine specifically designed for incremental indexing
rather than configuration/tuning of search engine designed for (easier,
but eventually more expensive) complete re-index every run. However
might well be possible to fit a plug in replacement of the actual
indexing machinery within the general framework provided by the rest of
Ht://Dig or using the Wilma framework provided by MHonArc. Worth
investigating possibilities available now even if turns out not to be
worth doing now as would certainly be needed eventually if only because
of computational inefficiency of continously re-indexing overwhelming
majority of messages that are old when the service has been running for
a fair while. Deciding future approach now could affect design of other
changes.

[AL]
>2. Readers who receive such error messages have no way to determine who
>is "the webmaster of thissite" - the reader who first encountered the
>error naturally reported it to the mailing list instead. I am guessing
>that [EMAIL PROTECTED] is the appropriate address as a result of finding a
>"feedback" link (not "webmaster" link) at
>http://www.mail-archive.com/about.html
>This can easily be fixed by including a mailto link for the appropriate
>address in the error message text.

[JB]
Yes, that error message sucks. However, I can't change it without
patching htdig, which I am reluctant to do from a maintainability
standpoint. Better to make sure the error never appears.

[AL]
Error messages potentially displayed to end users exist because it is
impractical to make sure the errors never happen (as opposed to
eliminating particular occasions as you have just done).

Such error messages should always provide an appropriate mailto link
which will always need to be customized per site. This is an Ht://Dig
problem rather than your problem, however a solution which avoids any
maintainability issues for you is to provide a generic patch for
Ht://Dig that adds a configuration variable for "USERBUGREPORTS=email
address displayed in error messages visible to end users" and patches
that in to templates for their error messages (or more likely to what
looks like single common text applicable to all such messages).

Then the patch would be maintained automatically as part of future
Ht://Dig releases without any extra maintenance effort.

[AL]
>3. The error message was generated a few days ago but did not result in
>a routine fix as a result of sysadmin perusal of error logs, since I
>have confirmed that it is still there now on any attempt to use the
>"Search" button. This may be unavoidable in a free service with "no
>warranty" etc etc but I hope it is useful to list it as a separate
>perceived problem for future attention - a proactive response to
>internally generated error messages will result in an even more useful
>service than one which responds to problems only after email from users
>who have encountered those problems.

[JB]
The service was designed to be as automatic and simple as possible for
administration. It really should run by itself. One reason is I am
doing almost all development, financing, administration, technical
support, etc., myself, during free time.

I agree it is important to look at logs promptly and fix any flaws
that surface.  Excellence in concept, design, implementation and
administration are all important for success. This is a real service,
that real people are using, and it appears to be growing quickly and I
want to keep it that way.  However, unless someone volunteers or gives
me a suitcase full of money, I'm going to have to continues focusing
on improving automation as opposed to improving manual administration.

[AL]
The excellence in concept, design and implementation of your service is
the incredible simplicity with which people who want to archive a list
can do so just by subscribing your archive address to the list without
installing and configuring any software or even having a place to
install it, and without having administrative control of a list or a web
site. This concept is really brilliant and any flaws can be forgiven ;-)

I agree it makes much more sense to concentrate on automation rather
than manual administration.
Glad to hear you may be implementing externally accessible detailed
logging. Suggest you should use standardized syslog type log entries
with priorities, categories and uniform fields for affected mailing
lists and message numbers so that they can be easily filtered using
standard syslog based log filters. Then you can easily have alarms
triggered by important errors for proactive response and statistics on
frequency of others for improving automation.

[...AL]
>If there is a problem, a solution might be to routinely manually
respond
>to confirmation messages (perhaps semi-automatically when the types
from
>the most popular list servers are recognized) instead of enabling
others
>to do that on your behalf.

[JB]
Hmm... I'm wary of manual solutions and added complexity.  However, an
auto-confirm would be handy as many people forget to do the
confirmation stage, especially if there is a delay involved.

[AL]
Agreed. I was confusing with separate issue of desirability of
permitting (optional) administrative controls over whether a list
actually gets added or blocked (which I gather is not a priority for
you).

Is auto-confirm covered in your to do list by the item:
"- deal with new/draft RFC's regarding list-ID; confirmation headers"

I'm not familiar with the RFCs on this but assume implementing them will
be useful for future but there will remain large numbers of lists with
no standardization of confirmation procedures.

Suggest possible approach as follows as optional alternative to just
subscribing the list as explained in the FAQ. Pinch code used to handle
subscriptions and confirmations for a particular list server such as
SmartList to implement. Aim is for semi-automation to become full
automation  able to handle thousands of new list additions without
service administrator intervention as easily as list servers can handle
tens of thousands of new user subscriptions (for multiple lists in both
cases). Essential point is that the problems are dual - single list
server normally handles subscriptions from large numbers of remote users
to large numbers of local lists. Single archive server subscribes to
large numbers of remote lists on behalf of one remote user for each list
using essentially the same methods as list server confirmations but also
relaying messages between the listservers and the users (like a poor
chess player taking on two grand masters and guaranteeing a win against
at least one or draw with both instead of losing to both as expected, by
simply relaying each of one of the grand master's moves to the other).

1. Add List form requests entry of list posting address list admin
address and email address of person adding list (defaulted to any info
extracted from browser). Also allow email to special address with this
info instead of web form - form just generates equivalent email so same
form processor at email address handles both using pinched code for
actual listserver (customizable so not necessarily done at same web site
- i.e. post results in email to form processor email address usually at
same location but different user as the archives subscription address,
but not necessarily same domain as web form or archives themselves, not
CGI-bin). Third input method could be to just "forward" (not bounce) a
message from the list to a specified email address (setup to expect
forwarding and re-process headers accordingly to recognize the name of
the list and perhaps type of list server software).

2. Input processing at the special email address results in confirmation
type message to person adding the list, including their key and
explanation of how to add the lists by replying and deleting
inapplicable options from the text of the reply, similar to and
implemented using existing listserver software such as SmartList.
Include advice that line saying "Stop sending me these messages!" MUST
be deleted from any reply or that will happen, and line saying "List is
still not visible in archive (attempt N) key:keystring" must NOT be
deleted from any reply in order for attempts to continue.

3. Text of reply includes list of standard options for common list
servers based on the 2 list addresses supplied and guesses about
majordomo, listserv, SmartList etc etc in order of most likely
subscription method, based on analysis of supplied addresses and lookup
of databases, including implementation of above RFCs.

4. Reply from person adding the list is munged and turned into the first
"subscribe style" remaining among those not deleted by the person adding
the list and transmitted directly from archive service subscription
address to guessed or specified list admin address with no information
referring to that person.

5. Careful checks built in to prevent user sending "subscribe me"
messages to the list itself - possibly including looking up the
addresses provided in various "list of lists" databases available
(results from which could also be provided on web site and/or by email).

6. First reply from list is assumed to be welcome or subscription
confirmation request and relayed to user with the user's key again added
together with reminder to reply even if the text says it is a welcome
not a confirmation and no reply appears necessary.

7. Reply from user is munged and relayed to list from the archiver
subscription address, hopefully including whatever key the list asked
for in order to confirm the subscription plus implementation of RFCs
etc.

8. Repeat for a small fixed number of attempts or until stop condition
results from deletion or non-deletion of lines specified in 2 above.

9. Monitoring and gradual refinement of the munging should result in
improvement from semi-automation to full automation over time as the
vagaries of list confirmation methods and user incomprehension are
encountered and dealt with.

10. Above approach may facilitate adding administrative controls over
which users are allowed to add categories of lists (e.g. by list of
authorized email addresses and/or keys issued to authorized users to add
to their messages or use when filling in the form.

11. Also can provide for users to add further information when filling
in the form or replying such as:
a) Identification of welcome message which can be grabbed and made
available from a link in the index pages to archives for that list so
that service administrator can use to automate or do manual unsubscribes
and people browsing archives can lookup (outdated) information about how
to use the list.
b) Identification of keys included within the welcome message so that
these can be removed from the publicly displayed copy in case they are
also used to manipulate other list functions.
(This could be done by guessing that random garbage within subject line
and body are probably keys and listing them in appendix to the welcome
message as relayed with request to delete those that should not be
removed from the text).
c) URL for FAQ or website associated with the list to be used as link in
the index pages to archives for that list.

12. This is all obviously much more complicated for both implementation
and use than the present approach, so it should only be an alternative
provided for use when that has not worked (or when control over
subscriptions is required).

RE: htdig, general customization

Reply via email to