Re: case sensitivity?

Chuq Von Rospach Mon, 21 Feb 2000 22:57:22 -0800
At 11:02 PM -0500 2/21/2000, Stan Ryckman wrote:
>
>I have a feeling there are some far corners of the world that we
>don't hear much from, that are still running really old stuff, that
>this might affect as well.

Yeah. But from the responses I've gotten on the list, nobody cna 
think of one recently. And my lists DO reach into those weird areas, 
and I can't think of a case, either. So I'm not going to worry about 
it -- the advantages to being case insensitive are legion, and the 
problem is that it's not strictly RFC conformant, but is conformant 
with how pretty much everyone else acts, so I'm not worried. Safe to 
say there's a defacto standard here, and at some point, the RFC 
probably needs to be updated...

>Isn't it really the same problem as [EMAIL PROTECTED], [EMAIL PROTECTED],
>and [EMAIL PROTECTED]?  Sometimes they're the same; other times
>they might be different (or different mailboxes, like shell vs. ppp).

Actually, they're different, because there are some well-accepted (if 
not formally defined) ways to deal with subdomaining. Doing the same 
for the account part is a lot less rigorous, because every IS 
department sets its standards differently, and then changes them 
every two years ("when I signed up, I was [EMAIL PROTECTED] Now 
I'm [EMAIL PROTECTED]"...)

I finished the routines that define my lookup process about 2AM this 
morning, so I know have a system that actually seems to work pretty 
well. Here's a quick overview for the curious.

I store in the SQL database the following info: the email address, 
the account part, the top 2 parts of the domain, the 3ld and the 4ld, 
if they're defined. If not, those parts are blank (not NULL). My 
basic assumptions are:

1) if they can get close, help them get the rest of the way.

2) don't disclose addresses unless you're damn sure it's them. No 
"close, choose which one you are" stuff.

3) I'm not overly worried about unsubscribe slams. First, they're 
almost unheard of. Second, just about every MLM in the universe will 
allow for a "unsubscribe foolist [EMAIL PROTECTED]".

4) The realities of the SQL database need to be kept in mind: I need 
to narrow searches through indexing, because grepping five million 
lines simply doesn't cut it.

So an address like "[EMAIL PROTECTED]" would end up stored as:

chuq, fred, bar, plaidworks.com plus the full address. the full 
address is UNIQUE, so I can guarantee no duplicates.

So when a user comes to find themselves, so to speak, I do the following:

1) try the easy one first -- look up the exact email. I'm going to be 
interested how my stats come out as to how often this hits.....

2) Now we split up the address being searched into its component 
pieces. And I then look up as an exact match on the acct, first 
trying the 4ld+3ld+2ld, then 3ld+2ld, then 2ld, so we go least 
general to most general.

3) Now we retry the searches, but meta-card the username (search on 
"%chuq%" instead of "chuq").

4) Finally, I simply look up the domain name with NO account at all, 
same 4d/3d/2d as before.

In all cases, I return a match ONLY if I match exactly one address. 
If I get zero or more than one, it's a failure. I stop on the first 
successful match -- and since I'm getting more and more general in my 
searches, I'm slowly widening the search until I run out of searches. 
Pretty much everything is indexed, and by using a LIMIT to the query, 
I can even cut things like "[EMAIL PROTECTED]" short and keep ti fast.

The 2nd set of searches takes care of the "[EMAIL PROTECTED]" vs 
"[EMAIL PROTECTED]" issue, as long as both aren't subscribed. If they 
are, "[EMAIL PROTECTED]" won't match, and they'd have to add the 3d to it 
to find one, then they can deal with the other. But for someone who's 
not sure WHICH domain they were using, so "[EMAIL PROTECTED]" will find 
all those nasty subdomains...

the 3rd set works to deal with the issues of those changing IS naming 
standards, and, in fact, allows a user to try to look for their name 
through a series of guesses if they want. How far it'll be documented 
is still TBD, but it'll be there (and the admins will use it!).

The fourth one may not be intuitively obvious -- but there's a 
growing number of domains where ALL mail is forwarded to a single 
person. And for non-VERPed mail, it's pretty literally impossible to 
find the address, unless someone's stuffed something in a Received 
line along the way. I have that setup on chuqui.com, for instance, so 
if you send email to [EMAIL PROTECTED], I'll get it. If it's 
Bcced, it'll be difficult for me to find out what address it came 
from. But on my system, I could type in [EMAIL PROTECTED], and if 
there's only ONE address subscribed, I'll count it a match and return 
it. This solves the problem for all of those one-owner domains, and 
for places like "foo.demon.co.uk" subdomains (and that's why I store 
both the 3ld and 4ld -- because overseas, there are a huge, and 
growing, number of domains that aren't unique until that fourth part 
-- 1ld is the country code, 2ld is their ".com", 3ld is the ISP, and 
4LD is the actual domain. And if you build a large set of addresses 
and don't search off that fourth part, it gets really nasty really 
fast -- and slow.

An early version of this actually reversed the domain name and stored 
it that way, and indexed the reversed domain, but if you think about 
it, that's not very efficient, since half your database will live off 
of the "moc." part of the index. Better to leave them rightside up 
and randomize the start of the indexes more, but i that case, it then 
makes sense to only index the 2ld, and then it makes sense to keep 
the 3ld and 4ld as separately selectable fields, if only so you can 
get them out of the way when you don't want them....

So far, the early tests have been quite encouraging. We'll see how it 
works once real users get their hands on it.

>How about that to do so makes it difficult to credibly criticize others
>who violate other (probably more important) RFCs?

Nah -- anyone how has a solid rationale to avoid an aspect of an RFC, 
and the research to back up that it doesn't cause any significant 
harm is welcome to ignore it. But if they do it just because they 
feel like it, that's another matter. After all, slavish following of 
standards leads to stagnation. Slavish disregard of them leads to 
chaos. It's that spot in the middle that leads to both useful systems 
AND innovation.....


>If there really aren't any anymore, it ought to be pretty trivial to
>change the RFC, right?  Go for it.  :-)

Yeah. as soon as I have some spare time. Although, since sendmail has 
been doing this for year and effectively we're all following their 
lead, why not call Eric and have him champion it? I'm not innovating 
here -- I'm simply validating that a variation between real-world and 
standard fallws to the side of the real-world.



--
Chuq Von Rospach - Plaidworks Consulting (mailto:[EMAIL PROTECTED])
Apple Mail List Gnome (mailto:[EMAIL PROTECTED])

And they sit at the bar and put bread in my jar
and say 'Man, what are you doing here?'"
Re: case sensitivity?

Reply via email to