At 11:02 PM -0500 2/21/2000, Stan Ryckman wrote:
>
>I have a feeling there are some far corners of the world that we
>don't hear much from, that are still running really old stuff, that
>this might affect as well.
Yeah. But from the responses I've gotten on the list, nobody cna
think of one recently. And my lists DO reach into those weird areas,
and I can't think of a case, either. So I'm not going to worry about
it -- the advantages to being case insensitive are legion, and the
problem is that it's not strictly RFC conformant, but is conformant
with how pretty much everyone else acts, so I'm not worried. Safe to
say there's a defacto standard here, and at some point, the RFC
probably needs to be updated...
>Isn't it really the same problem as [EMAIL PROTECTED], [EMAIL PROTECTED],
>and [EMAIL PROTECTED]? Sometimes they're the same; other times
>they might be different (or different mailboxes, like shell vs. ppp).
Actually, they're different, because there are some well-accepted (if
not formally defined) ways to deal with subdomaining. Doing the same
for the account part is a lot less rigorous, because every IS
department sets its standards differently, and then changes them
every two years ("when I signed up, I was [EMAIL PROTECTED] Now
I'm [EMAIL PROTECTED]"...)
I finished the routines that define my lookup process about 2AM this
morning, so I know have a system that actually seems to work pretty
well. Here's a quick overview for the curious.
I store in the SQL database the following info: the email address,
the account part, the top 2 parts of the domain, the 3ld and the 4ld,
if they're defined. If not, those parts are blank (not NULL). My
basic assumptions are:
1) if they can get close, help them get the rest of the way.
2) don't disclose addresses unless you're damn sure it's them. No
"close, choose which one you are" stuff.
3) I'm not overly worried about unsubscribe slams. First, they're
almost unheard of. Second, just about every MLM in the universe will
allow for a "unsubscribe foolist [EMAIL PROTECTED]".
4) The realities of the SQL database need to be kept in mind: I need
to narrow searches through indexing, because grepping five million
lines simply doesn't cut it.
So an address like "[EMAIL PROTECTED]" would end up stored as:
chuq, fred, bar, plaidworks.com plus the full address. the full
address is UNIQUE, so I can guarantee no duplicates.
So when a user comes to find themselves, so to speak, I do the following:
1) try the easy one first -- look up the exact email. I'm going to be
interested how my stats come out as to how often this hits.....
2) Now we split up the address being searched into its component
pieces. And I then look up as an exact match on the acct, first
trying the 4ld+3ld+2ld, then 3ld+2ld, then 2ld, so we go least
general to most general.
3) Now we retry the searches, but meta-card the username (search on
"%chuq%" instead of "chuq").
4) Finally, I simply look up the domain name with NO account at all,
same 4d/3d/2d as before.
In all cases, I return a match ONLY if I match exactly one address.
If I get zero or more than one, it's a failure. I stop on the first
successful match -- and since I'm getting more and more general in my
searches, I'm slowly widening the search until I run out of searches.
Pretty much everything is indexed, and by using a LIMIT to the query,
I can even cut things like "[EMAIL PROTECTED]" short and keep ti fast.
The 2nd set of searches takes care of the "[EMAIL PROTECTED]" vs
"[EMAIL PROTECTED]" issue, as long as both aren't subscribed. If they
are, "[EMAIL PROTECTED]" won't match, and they'd have to add the 3d to it
to find one, then they can deal with the other. But for someone who's
not sure WHICH domain they were using, so "[EMAIL PROTECTED]" will find
all those nasty subdomains...
the 3rd set works to deal with the issues of those changing IS naming
standards, and, in fact, allows a user to try to look for their name
through a series of guesses if they want. How far it'll be documented
is still TBD, but it'll be there (and the admins will use it!).
The fourth one may not be intuitively obvious -- but there's a
growing number of domains where ALL mail is forwarded to a single
person. And for non-VERPed mail, it's pretty literally impossible to
find the address, unless someone's stuffed something in a Received
line along the way. I have that setup on chuqui.com, for instance, so
if you send email to [EMAIL PROTECTED], I'll get it. If it's
Bcced, it'll be difficult for me to find out what address it came
from. But on my system, I could type in [EMAIL PROTECTED], and if
there's only ONE address subscribed, I'll count it a match and return
it. This solves the problem for all of those one-owner domains, and
for places like "foo.demon.co.uk" subdomains (and that's why I store
both the 3ld and 4ld -- because overseas, there are a huge, and
growing, number of domains that aren't unique until that fourth part
-- 1ld is the country code, 2ld is their ".com", 3ld is the ISP, and
4LD is the actual domain. And if you build a large set of addresses
and don't search off that fourth part, it gets really nasty really
fast -- and slow.
An early version of this actually reversed the domain name and stored
it that way, and indexed the reversed domain, but if you think about
it, that's not very efficient, since half your database will live off
of the "moc." part of the index. Better to leave them rightside up
and randomize the start of the indexes more, but i that case, it then
makes sense to only index the 2ld, and then it makes sense to keep
the 3ld and 4ld as separately selectable fields, if only so you can
get them out of the way when you don't want them....
So far, the early tests have been quite encouraging. We'll see how it
works once real users get their hands on it.
>How about that to do so makes it difficult to credibly criticize others
>who violate other (probably more important) RFCs?
Nah -- anyone how has a solid rationale to avoid an aspect of an RFC,
and the research to back up that it doesn't cause any significant
harm is welcome to ignore it. But if they do it just because they
feel like it, that's another matter. After all, slavish following of
standards leads to stagnation. Slavish disregard of them leads to
chaos. It's that spot in the middle that leads to both useful systems
AND innovation.....
>If there really aren't any anymore, it ought to be pretty trivial to
>change the RFC, right? Go for it. :-)
Yeah. as soon as I have some spare time. Although, since sendmail has
been doing this for year and effectively we're all following their
lead, why not call Eric and have him champion it? I'm not innovating
here -- I'm simply validating that a variation between real-world and
standard fallws to the side of the real-world.
--
Chuq Von Rospach - Plaidworks Consulting (mailto:[EMAIL PROTECTED])
Apple Mail List Gnome (mailto:[EMAIL PROTECTED])
And they sit at the bar and put bread in my jar
and say 'Man, what are you doing here?'"