RE: [Boston.pm] Email filtering...

Bob Rogers Tue, 11 Feb 2003 12:23:45 -0800

   From: "Wizard" <[EMAIL PROTECTED]>
   Date: Tue, 11 Feb 2003 10:35:11 -0500


   > Here's another one: future proofing. The two or three character TLD
   > constraint you see today isn't necessary, and maybe in the future we'll
   > see longer addresses ([EMAIL PROTECTED]). Or Rendezvous
   > might catch on, and addresses of the form [EMAIL PROTECTED] might become
   > common in some contexts.

   Really, I'm only concerned with the present for the moment. As long as TLDs
   have more than two letters, and CCs all have two letters, 

So far, that's been true (though the country codes are officially TLDs
themselves).

   that would be enough for me, except that how do I determine the
   difference between the CNAME/DOMAIN/TLD?

The short answer is, you can't.  A TLD is always recognizable by its
position at the end of the DNS name, but as for "cname" vs. "domain",
you are trying to make a distinction that doesn't formally exist.  Which
makes this DNS name dissection rather pointless, IMHO.  But, apparently,
you have a legacy API to support . . .

   > If you're hell bent on doing this yourself, I'm sure the relevant RFCs
   > would be a good source of answers. I think rfc822 is the main mail one
   > relevant to your needs, but someone may correct me on that...

   Yeah, it appears that's my only real option. I was hoping that this wouldn't
   have been such a nightmare.

;-}

RFC822 is indeed the introduction to this madhouse, and you should also
check out http://cr.yp.to/immhf.html for a readable "real-world"
introduction to mail headers.  The author, D. J. Bernstein, is also the
author of qmail and ezmlm, and has an extremely informative Web site.
(He doesn't pull his punches, either.)

   Also, I've attached a code fragment I wrote to validate email address
syntax.  All you can really do is verify that the domain part can be
resolved to an address or an MX, and that the localpart is syntactically
correct.  There is no way to check the localpart without actually trying
to send mail (though I gather that's not a requirement for your app).

   The full app (see http://bmerc-www.bu.edu/psa/request.htm) contains
other code to limit requests from "commercial" domains, as classified by
email.  But I'm sure no one will be surprised to learn that this is
leaky as hell -- not only are whole countries exempted by their lack of
classifying subdomains, but it's too easy to get a ".net" or ".org"
domain these days, e.g. from dyndns.org.

   Hope this helps you get through the night,

                                        -- Bob Rogers
                                           http://rgrjr.dyndns.org/

------------------------------------------------------------------------
### email address validation
# The main entrypoint is the validate_email_address subroutine.

$rfc822_illegal_atom_character = '[\000-\037\177-\377 ()<>@,;:\\\"\[\]]';
# RFC822 says that the local-part of an address is a nonempty sequence of
# dot-separated words, each of which is either a quoted string or an atom.  An
# atom, in turn, is a nonempty string of "any CHAR except specials, SPACE and
# CTLs" (where DEL is considered a CTL).  The domain part of the address is
# similar, except there is also something called a "domain-literal", which looks
# like a quoted string in square braces, but I have not seen this syntax used
# for anything but raw IP addresses, which we refuse to support (can't tell if
# they are commercial).  Because it lumps the words together, therefore, this
# regular expression will fail on addresses that use the quoted-string syntax.
# But using the regexp saves us some parsing.  -- rgr, 14-Dec-98.  [Also finally
# restricted "CHAR" to exclude non-ASCII (e.g. ISO) extension characters.  --
# rgr, 1-May-00.]  [We need to do separate tests for empty words (i.e. ".." or
# ".@"), so we could in fact handle atoms that were quoted strings.  But I've
# never seen quoted strings used in actual email addresses; I doubt if most
# sysadmins know it's possible to assign addresses like this.  And I suspect
# most users would balk if given such an address.  -- rgr, 25-Aug-99.]
$illegal_address_boilerplate
    = qq(Unless you specify an
         <a href="inparms.htm#emadr">Internet format e-mail address</a>,
         e.g. <tt>&quot;rogers\@darwin.bu.edu&quot;</tt>, the server will not
         be able to send you e-mail.);
sub address_error {
    # interface to generate_error_message for address error messages.  also
    # returns zero, for the convenience of callers that must return zero for an
    # error.
    my $message = shift;

    if ($illegal_address_boilerplate) {
        $message .= "  $illegal_address_boilerplate";
        # We've used it, so no need to duplicate it.
        $illegal_address_boilerplate = '';
    }
    generate_error_message($message);
    0;
}

sub check_address_component {
    # Check the syntax of the passed address component, using $description and
    # $offset to generate error messages.  In RFC822 terminology, $component is
    # either the local-part or domain of an email address (addr-spec).  Returns
    # zero if the address is syntactically invalid, else the number of atoms
    # (dot-separated pieces) in the component.
    my ($component, $description, $offset) = @_;

    if ($component =~ /$rfc822_illegal_atom_character+/o) {
        # one-based for the user, plus the local-part length for hosts.
        my $bad_start = length($`)+$offset;
        # [we'd like to get $bad_chars in the message.  but the only reliable
        # way to avoid confusing the browser is to render them in hex, which the
        # user is not likely to understand.  -- rgr, 14-Dec-98.]
        my $bad_chars = $&;
        address_error
            ("$description contains "
             . (length($bad_chars) > 1
                ? (length($bad_chars) . " illegal characters with the first "
                   . "one starting")
                : "an illegal character")
             . " at position $bad_start"
             . ($bad_chars =~ /^[\040-\176]*$/
                ? ", namely &quot;" . html_quotify($bad_chars) . "&quot;."
                : '.'));
    }
    else {
        # check the dots.
        my $atom;
        # need the "-1" at the end in order to keep trailing empty fields.
        my @atoms = split(/\./, $component, -1);
        my $n_atoms = scalar @atoms;

        generate_warning_message("Got $n_atoms atoms from "
                                 . "&quot;$component&quot;.")
            if $verbose_p;
        unless ($n_atoms) {
            address_error(qq($description is empty.));
        }
        foreach $atom (@atoms) {
            # we've already checked for illegal characters (including space,
            # tab, and newline), so we only have to complain if there is an
            # empty token.
            unless (length($atom)) {
                address_error
                    (qq($description contains an empty token before
                        position $offset, which is not valid for an Internet
                        e-mail address.));
                $n_atoms = 0;
            }
            $offset += length($atom)+1;
        }
        # if we got no atom errors, return the count of dotted tokens, for
        # further host name checking.
        $n_atoms;
    }
}

sub validate_email_address {
    # Do syntactic checking on the passed email address and, if syntactically
    # valid, attempt to verify that the mail host exists.  Returns 1 if valid,
    # else 0.
    my $email_address = shift;
    my @address_parts;
    my ($local_part, $domain_name, $result, $command) = ('', '', '');
    my $offset = 1;     # return 1-based complaints to the user.

    # this whitespace-trimming may be (or become) redundant.
    $email_address =~ s/^\s+//;
    $email_address =~ s/\s+$//;
    if ($email_address =~ /^([^<>]*<)([^<>]+)>[^<>]*$/) {
        # This implements a subset of the RFC822 "route-addr" syntax; the
        # regular expression matches exactly one "<>" pair with optional stuff
        # around it.  But the "route" part (zero or more colon-terminated domain
        # names after the "<" and before the actual mailbox) is completely
        # ignored, and will fail the illegal character testing.  I am assuming
        # that routing is not needed in the '90's.  In any case, we only need to
        # check the mailbox itself for legality.  -- rgr, 14-Dec-98.
        $offset += length($1);
        $email_address = $2;
    }
    # need the "-1" at the end in order to keep trailing empty fields.
    @address_parts = split('@', $email_address, -1);
    ($local_part, $domain_name) = @address_parts;
    if (@address_parts == 2) {
        # check the local part first, since that's how the user sees it.
        check_address_component($local_part,
                                ("The user part of the address "
                                 . "(before the at-sign)"),
                                $offset);
        # Have some kind of host name; validate it here.
        my $n_domain_atoms =
            check_address_component($domain_name,
                                    ("The host part of the address "
                                     . "(after the at-sign)"),
                                    # the +1 is for the at-sign.
                                    $offset+length($local_part)+1);
        if ($n_domain_atoms == 0) {
            # error message already generated; no point in checking further.
            return(0);
        }
        elsif ($n_domain_atoms == 1) {
            address_error
                ("There are no dot ('.') characters in the host part of the "
                 . " address <tt>&quot;" . html_quotify($email_address)
                 . "&quot;</tt>.");
            return(0);
        }
        # If we got here, we have a syntactically valid host domain name.  Now
        # see if we can send mail to it.
        $command = "nslookup -query=MX -timeout=10 '$domain_name'";
        $result = `csh -c "$command |& cat"`;
        # [the "csh -c 'cmd |& cat'" nonsense mixes the stderr in with the
        # stdout; otherwise it's discarded.  -- rgr, 7-Sep-99.]
        generate_warning_message(#'$PATH: '.html_quotify($ENV{'PATH'})."<br>\n".
                                 qq(Full MX query:<blockquote>
                                    <pre>$command</pre></blockquote>
                                    <tt>nslookup</tt> result:<blockquote>
                                    <pre>$result</pre></blockquote>\n))
            if $verbose_p >= 3;
        # nslookup doesn't return an error code, so we have to examine the
        # printed result.
        if ($result =~ /mail exchanger/) {
            # found at least one MX record for this host.
            0;
        }
        elsif ($result =~ /\*\*\*.*/) {
            # error message in MX retrieval; probably no such domain.
            address_error
                ("Our server cannot figure out how to send mail to <tt>&quot;"
                 . html_quotify($domain_name) . "&quot;</tt>.  Here is the "
                 . "error returned by the name server:\n<blockquote>"
                 . html_quotify($&)
                 . "</blockquote>\nPlease recheck the spelling.");
        }
        elsif ($result = `nslookup -timeout=10 '$domain_name'`,
                 $result !~ /Name: *$domain_name\nAddress:/i) {
            # no MX records, and no such host.
            address_error
                (qq(Our server cannot find the host named
                    <tt>'$domain_name'</tt>.  Please recheck the spelling.));
        }
        else {
            # the host exists, so we'll assume the address is good.
            1;
        }
    }
    else {
        address_error
            (qq(There must be exactly one <a href="inparms.htm#emadr">
                at-sign (<tt>'\@'</tt>)</a> in the address <tt>&quot;)
             . html_quotify($email_address) . '&quot;</tt>.');
        # It's all user part, but we might as well check it now.
        check_address_component($local_part,
                                "The user part of the address",
                                $offset);
        0;
    }
}
_______________________________________________
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm

RE: [Boston.pm] Email filtering...

Reply via email to