[JB] Total archive size is ~110,000 messages. Growth is currently under 1000 per day. There may be room for improving search indexing efficiency (incrementallity?), which could deliver tangible benefits, as you mentioned.
[AL] What is the corresponding average number of Web requests (for messages and for indexes) per day? Presumably dividing that number by each of the total archive size and new messages per day would provide two major indicators that could be correlated with expected future web traffic, but would be misleading without a pattern for Web requests as a function of the length of time the message has been stored. Do you have any way to extract that "decay" pattern from Apache logs correlated with timestamps on corresponding message files? Also what is the 24 hour and 7 day pattern for web requests and for new message arrivals to estimate their peak periods for capacity projections? If for example one used the crude 80/20 rule to suppose that there are currently 800 new messages arriving over 5 peak period hours (approx 20% of the day), that is 160 per hour. Your FAQ estimates a capacity for 5,000 messages per hour with current configuration so (assuming unchanged distribution of list sizes etc) that would allow scaling to about a factor of 60 - say from 150 lists to 10,000 (still a very small part of the total of half a million or so estimated to already exist prior to expected explosion in list use when the enormous numbers of people still learning to "surf the web" discover that actual participation through lists is more interesting). Earlier calculations show factor of 17 in total size before continuous weekly indexing necessary. This suggests incremental indexing more urgent than improvements in new message per hour capacity. Even more so since total size will grow just from age of archives as well as from growth in new messages. Would be more difficult but interesting to also do stats on correlations between average Web requests per message per week, new messages per week, number of subscribers, age of list, age of list archive and number of posters who generate various proportions of the posted messages - also standard deviations etc (and breakdowns by categories of list when these are established). (Some of these figures may be available from surveys of major lists, correlation with archive web requests is the thing your analysis could contribute. There are probably lots of social science types doing "internet" related projects who could be encouraged to look into this.) [JB] I chose htDig because it appeared to beat out the competition by being cleanly implemented, documented, robust, credible, customizable, well supported, actively maintained, very fast at searching, linux compatible, open source, widely used, and installable via RPM. Its biggest drawbacks are that indexing (as currently configured) takes time, and that it uses a good chunk of disk space for its indexes. (I'd rather buy more disk space than a bigger CPU, though) [AL] Sounds like good reasons! Long term speed of searching will be more important than speed of indexing and disk space. More below. (Catching up with items not replied to in private email from earlier message: http://www.mail-archive.com/[email protected]/msg00041.html ) [JB] Ideally, messages should be indexed for searching as soon as they are received. However, if this is not practical (due to computational expense) your solution is a good one. There still may be some confusion, as the search engine will only match the first message. Still, this is better than the current situation. [AL] Thanks for having already fixed error message problem so promptly! Incremental indexing (either for each batch of messages or in frequent larger batches such as daily or hourly) would not be just "ideal" but an eventually essential significant enhancement over weekly index runs which must be inherently out of date and very wasteful in re-processing entire total number of messages instead of just new messages. I still haven't checked out Ht://Dig and glimpse or other alternatives but would guess that avoiding excessive computational expense would require search engine specifically designed for incremental indexing rather than configuration/tuning of search engine designed for (easier, but eventually more expensive) complete re-index every run. However might well be possible to fit a plug in replacement of the actual indexing machinery within the general framework provided by the rest of Ht://Dig or using the Wilma framework provided by MHonArc. Worth investigating possibilities available now even if turns out not to be worth doing now as would certainly be needed eventually if only because of computational inefficiency of continously re-indexing overwhelming majority of messages that are old when the service has been running for a fair while. Deciding future approach now could affect design of other changes. [AL] >2. Readers who receive such error messages have no way to determine who >is "the webmaster of thissite" - the reader who first encountered the >error naturally reported it to the mailing list instead. I am guessing >that [EMAIL PROTECTED] is the appropriate address as a result of finding a >"feedback" link (not "webmaster" link) at >http://www.mail-archive.com/about.html >This can easily be fixed by including a mailto link for the appropriate >address in the error message text. [JB] Yes, that error message sucks. However, I can't change it without patching htdig, which I am reluctant to do from a maintainability standpoint. Better to make sure the error never appears. [AL] Error messages potentially displayed to end users exist because it is impractical to make sure the errors never happen (as opposed to eliminating particular occasions as you have just done). Such error messages should always provide an appropriate mailto link which will always need to be customized per site. This is an Ht://Dig problem rather than your problem, however a solution which avoids any maintainability issues for you is to provide a generic patch for Ht://Dig that adds a configuration variable for "USERBUGREPORTS=email address displayed in error messages visible to end users" and patches that in to templates for their error messages (or more likely to what looks like single common text applicable to all such messages). Then the patch would be maintained automatically as part of future Ht://Dig releases without any extra maintenance effort. [AL] >3. The error message was generated a few days ago but did not result in >a routine fix as a result of sysadmin perusal of error logs, since I >have confirmed that it is still there now on any attempt to use the >"Search" button. This may be unavoidable in a free service with "no >warranty" etc etc but I hope it is useful to list it as a separate >perceived problem for future attention - a proactive response to >internally generated error messages will result in an even more useful >service than one which responds to problems only after email from users >who have encountered those problems. [JB] The service was designed to be as automatic and simple as possible for administration. It really should run by itself. One reason is I am doing almost all development, financing, administration, technical support, etc., myself, during free time. I agree it is important to look at logs promptly and fix any flaws that surface. Excellence in concept, design, implementation and administration are all important for success. This is a real service, that real people are using, and it appears to be growing quickly and I want to keep it that way. However, unless someone volunteers or gives me a suitcase full of money, I'm going to have to continues focusing on improving automation as opposed to improving manual administration. [AL] The excellence in concept, design and implementation of your service is the incredible simplicity with which people who want to archive a list can do so just by subscribing your archive address to the list without installing and configuring any software or even having a place to install it, and without having administrative control of a list or a web site. This concept is really brilliant and any flaws can be forgiven ;-) I agree it makes much more sense to concentrate on automation rather than manual administration. Glad to hear you may be implementing externally accessible detailed logging. Suggest you should use standardized syslog type log entries with priorities, categories and uniform fields for affected mailing lists and message numbers so that they can be easily filtered using standard syslog based log filters. Then you can easily have alarms triggered by important errors for proactive response and statistics on frequency of others for improving automation. [...AL] >If there is a problem, a solution might be to routinely manually respond >to confirmation messages (perhaps semi-automatically when the types from >the most popular list servers are recognized) instead of enabling others >to do that on your behalf. [JB] Hmm... I'm wary of manual solutions and added complexity. However, an auto-confirm would be handy as many people forget to do the confirmation stage, especially if there is a delay involved. [AL] Agreed. I was confusing with separate issue of desirability of permitting (optional) administrative controls over whether a list actually gets added or blocked (which I gather is not a priority for you). Is auto-confirm covered in your to do list by the item: "- deal with new/draft RFC's regarding list-ID; confirmation headers" I'm not familiar with the RFCs on this but assume implementing them will be useful for future but there will remain large numbers of lists with no standardization of confirmation procedures. Suggest possible approach as follows as optional alternative to just subscribing the list as explained in the FAQ. Pinch code used to handle subscriptions and confirmations for a particular list server such as SmartList to implement. Aim is for semi-automation to become full automation able to handle thousands of new list additions without service administrator intervention as easily as list servers can handle tens of thousands of new user subscriptions (for multiple lists in both cases). Essential point is that the problems are dual - single list server normally handles subscriptions from large numbers of remote users to large numbers of local lists. Single archive server subscribes to large numbers of remote lists on behalf of one remote user for each list using essentially the same methods as list server confirmations but also relaying messages between the listservers and the users (like a poor chess player taking on two grand masters and guaranteeing a win against at least one or draw with both instead of losing to both as expected, by simply relaying each of one of the grand master's moves to the other). 1. Add List form requests entry of list posting address list admin address and email address of person adding list (defaulted to any info extracted from browser). Also allow email to special address with this info instead of web form - form just generates equivalent email so same form processor at email address handles both using pinched code for actual listserver (customizable so not necessarily done at same web site - i.e. post results in email to form processor email address usually at same location but different user as the archives subscription address, but not necessarily same domain as web form or archives themselves, not CGI-bin). Third input method could be to just "forward" (not bounce) a message from the list to a specified email address (setup to expect forwarding and re-process headers accordingly to recognize the name of the list and perhaps type of list server software). 2. Input processing at the special email address results in confirmation type message to person adding the list, including their key and explanation of how to add the lists by replying and deleting inapplicable options from the text of the reply, similar to and implemented using existing listserver software such as SmartList. Include advice that line saying "Stop sending me these messages!" MUST be deleted from any reply or that will happen, and line saying "List is still not visible in archive (attempt N) key:keystring" must NOT be deleted from any reply in order for attempts to continue. 3. Text of reply includes list of standard options for common list servers based on the 2 list addresses supplied and guesses about majordomo, listserv, SmartList etc etc in order of most likely subscription method, based on analysis of supplied addresses and lookup of databases, including implementation of above RFCs. 4. Reply from person adding the list is munged and turned into the first "subscribe style" remaining among those not deleted by the person adding the list and transmitted directly from archive service subscription address to guessed or specified list admin address with no information referring to that person. 5. Careful checks built in to prevent user sending "subscribe me" messages to the list itself - possibly including looking up the addresses provided in various "list of lists" databases available (results from which could also be provided on web site and/or by email). 6. First reply from list is assumed to be welcome or subscription confirmation request and relayed to user with the user's key again added together with reminder to reply even if the text says it is a welcome not a confirmation and no reply appears necessary. 7. Reply from user is munged and relayed to list from the archiver subscription address, hopefully including whatever key the list asked for in order to confirm the subscription plus implementation of RFCs etc. 8. Repeat for a small fixed number of attempts or until stop condition results from deletion or non-deletion of lines specified in 2 above. 9. Monitoring and gradual refinement of the munging should result in improvement from semi-automation to full automation over time as the vagaries of list confirmation methods and user incomprehension are encountered and dealt with. 10. Above approach may facilitate adding administrative controls over which users are allowed to add categories of lists (e.g. by list of authorized email addresses and/or keys issued to authorized users to add to their messages or use when filling in the form. 11. Also can provide for users to add further information when filling in the form or replying such as: a) Identification of welcome message which can be grabbed and made available from a link in the index pages to archives for that list so that service administrator can use to automate or do manual unsubscribes and people browsing archives can lookup (outdated) information about how to use the list. b) Identification of keys included within the welcome message so that these can be removed from the publicly displayed copy in case they are also used to manipulate other list functions. (This could be done by guessing that random garbage within subject line and body are probably keys and listing them in appendix to the welcome message as relayed with request to delete those that should not be removed from the text). c) URL for FAQ or website associated with the list to be used as link in the index pages to archives for that list. 12. This is all obviously much more complicated for both implementation and use than the present approach, so it should only be an alternative provided for use when that has not worked (or when control over subscriptions is required).
