Your message was fun, Jeff.
I reasoned this out similarly, thinking along the lines of base-64 used in
MIME. The permissible character set for DOS-platform file names contains at
least 46 characters. The number of different names expressible in 8 base-46
characters is sufficient to have a minuscule collision probability for
archives of any reasonable size. A 100,000 message archive seems two orders
of magnitude too high for MHonArc's basic design; anything that large using
a filesystem as its database needs to be organized hierarchically. That
would add a subdirectory namespace into the quota.
-- SP
> Anyway, sorry I didn't jump in then, but the kind-of-fun question was
> implicitly raised: how many bits of randomness do you need for
> reproducible URLs in MHonArc? (Hey, it's not every day that real life
> questions can be tackled like problem sets!)
> Now if we are restricted to ending the filenames with something like
> .htm, then there are only about 41 bits of randomness, and then we
> run about 1% risk of collision for a puny n=100,000 message archive.
> That's pushing it.
>
> Ok, one last note. If we use a real filesystem, with upper and lower
> case letters in the filenames, we'd still need 10 characters in the
> filename to meet/exceed the acceptable saftey margin (57 bits). So
> those lower case letters don't help us much in the region we are
> interested in.
>
> Using MD-5 checksums for filenames is complete overkill statisticly
> speaking. They are 128 bits, and would consume 20-odd characters in
> the filename. 10 character filenames would do the trick nicely. There
> is certainly no need to combine MD-5 and message-ID's from a
> statistical standpoint.