[whatwg] Comments on the definition of a valid e-mail address

2009-08-30 Thread Ian Hickson
On Sun, 23 Aug 2009, Aryeh Gregor wrote:

 Section 4.10.4.1.5 defines a valid e-mail address as follows:
 
 A valid e-mail address is a string that matches the production 
 dot-atom-text @ dot-atom-text where dot-atom-text is defined in RFC 
 5322 section 3.2.3. [RFC5322]
 
 This is much more restrictive than the full range of e-mail addresses 
 allowed by RFC 5322 et al.  I've been considering whether to use input 
 type=email in MediaWiki, and whether to change our server-side e-mail 
 address validation to match.  Historically, MediaWiki has mostly just 
 required that an @ symbol be present in the address. Originally we used 
 a simplistic regex, but when users complained, we looked into the RFCs 
 and decided it was too complicated to bother with validation beyond 
 checking for an @ sign.
 
 So before switching us over, I decided to do some research on how many 
 users' addresses would be invalidated.  I used the database for the 
 English Wikipedia.  Over all registered users, I found 3,088,880 
 confirmed addresses, not necessarily all distinct.  (Confirmed here 
 means that in theory, modulo bugs, the user followed a confirmation link 
 in the e-mail they received, so the address probably works in practice.)  
 Of those, 3,255 (~0.1%) failed HTML 5 validation, as determined using 
 the following regex-based database query:
 
 r...@rosemary:enwiki SELECT COUNT(*) FROM user WHERE
 user_email_authenticated IS NOT NULL AND user_email NOT REGEXP
 '^[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+)*...@[-a-za-z0-9!#$%\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+)*$'
 AND user_email != '';
 +--+
 | COUNT(*) |
 +--+
 | 3255 |
 +--+
 1 row in set (16 min 10.80 sec)

Thanks for this research, this is exactly the kind of hard data that is 
most useful when writing the spec.


 (Someone please tell me if my regex doesn't match HTML 5 here.)

If we let 

   X = [-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+

...then the regexp is:

   ^X(\.X)*...@x(\.X)*$

I believe this is correct, yes.


 Inspection showed that the overwhelming majority of the failures were 
 due to the presence of excess whitespace, often a single trailing space, 
 or a space inserted before or after the @ sign.  When I adjusted the 
 regex to ignore those failures, I got a smaller list, 202 (about 0.007% 
 of the total): [...]
 
 Some of these were clearly wrong, and shouldn't have been confirmed to 
 begin with.  Some even didn't have an @ sign, so probably were submitted 
 in some window when we did no validation at all (and I have no idea how 
 they got confirmed).  Of the ones that possibly work, I identified two 
 major categories:
 
 1) Addresses in the form foo b...@baz.example, or similar.  These 
 mostly match RFC 5322's name-addr production instead of addr-spec (some 
 have trailing semicolons, or are missing the initial , etc.). I assume 
 these were copy-pasted from a mail application.

These are intentionally not allowed, since it is expected that the name 
will be taken from elsewhere, and the e-mail address will then be pasted 
into a template with along the lines of $name $email.


 2) Addresses with dots in incorrect places, in either the local part
 or the domain name part.  For instance, multiple consecutive dots, or
 leading/trailing dots.  These don't match RFC 5322 at all AFAICT, but
 I asked one of the users with an invalid address of the form
 f...@example.com, and he said it worked fine for him.  GNU mail gave
 a syntax error when I tried to send mail to that address, but Gmail
 sent it without complaint, and the user received it successfully.

I've change the grammar to allow a trailing dot in the username part.


 I should also note that this was only the English Wikipedia, and it 
 might be that speakers of other languages are more prone to use other 
 types of addresses that don't meet HTML 5's specification.  When looking 
 at the Swedish and German databases, for instance, I found one or two 
 addresses that had apparently been confirmed but contained non-ASCII 
 characters.  I didn't know the users with those addresses, and I didn't 
 want to send them unsolicited mail, so I wasn't able to establish 
 whether those addresses actually worked or the confirmation was bogus.

I'll leave it as requiring ASCII for now; I expect UAs to do IDNA 
processing on the UI end for the domain side. I'm not sure what is 
supposed to happen on the username side.


 Conclusions: At a minimum, I suggest that HTML 5 require that user 
 agents strip all whitespace from e-mails, not just newlines.  Roughly 
 0.1% of the addresses from my sample were valid except for extraneous 
 whitespace.  It's a small additional change that would cut the number of 
 illegitimately invalid addresses in my sample by a factor of more than 
 ten.

This is a UI issue -- if the user enters whitespace, the user agent is 
allowed to trim it. It won't submit with whitespace, so user agents are 
likely to want to do 

Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-27 Thread Nils Dagsson Moskopp
Am Montag, den 24.08.2009, 16:33 -0400 schrieb Brian Campbell:
 Given that there are so many technically invalid addresses that  
 actually do work to deliver mail, and that I'm sure some people have  
 odd addresses due to poor form validation […]

Well, maybe the RFC should be updated as well ? Or is this just normal
accounting for the robustness principle ?

Cheers
-- 
Nils Dagsson Moskopp
http://dieweltistgarnichtso.net



Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-24 Thread Smylers
Aryeh Gregor writes:

 Historically, MediaWiki has mostly just required that an @ symbol be
 present in the address.  Originally we used a simplistic regex,

It's relatively well known that a simple regex can't be used to match
e-mail addresses (and not match things that aren't!); Jeffrey Friedl's
'Mastering Regular Expressions' (O'Reilly) included a pattern for this
over a decade ago, but it is exceedingly long:

  http://groups.google.co.uk/group/comp.lang.perl.misc/msg/603ba6fc642a3124
  http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

 ... but when users complained, we looked into the RFCs and decided it
 was too complicated to bother with validation beyond checking for an @
 sign.

It's too complicated for most developers to roll their own validation,
but there are standard libraries available which get it right.

 ... I decided to do some research on how many users' addresses would
 be invalidated [by HTML 5's validation] ...
 
 1) Addresses in the form foo b...@baz.example, or similar.  These
 mostly match RFC 5322's name-addr production instead of addr-spec

Forms on websites capturing users' e-mail addresses typically want just
the address part, prompting for the human-readable name in a separate
box, so I think HTML 5's input type=email not allowing the above is
helpful.

 2) Addresses with dots in incorrect places, in either the local part
 or the domain name part.  For instance, multiple consecutive dots, or
 leading/trailing dots.  These don't match RFC 5322 at all AFAICT, but
 I asked one of the users with an invalid address of the form
 f...@example.com, and he said it worked fine for him.  GNU mail gave
 a syntax error when I tried to send mail to that address, but Gmail
 sent it without complaint, and the user received it successfully.

There may actually be several categories of oddly placed dots.  While
the address in the form you give above works it may be, say, that those
with repeated dots in the hostname part don't work.

On the specific case of a . immediately before the @, I've seen that
before: this Perl library module extends an RFC-compliant module to
allow just that; its author admits .@ breaks the RFCs but claims such
breakage is useful in the real world, specifically when dealing with
e-mail addresses for Japanese mobile phones:

  http://search.cpan.org/perldoc?Email::Valid::Loose

That somebody has found this to be a sufficiently widespread problem
with standard Perl e-mail address validation to write and upload a
module which 'fixes' this (and just that; it makes no other changes)
suggests that people will find HTML 5's input type=email to be
problematic in precisely the same way.

 There were other types of addresses that didn't meet HTML 5's
 specification after whitespace was stripped, but none with more than a
 single-digit number of addresses occurring in the sample of three
 million or so that I looked at.

So it may actually be that there isn't a general problem here of lots of
real-world e-mail addresses which work but don't comply with the RFCs;
it may simply be the one case of .@?

There aren't a plethora of Email::Valid extensions which relax various
different criteria; just the one which allows .@.

 Alternatively, you could just loosen the restrictions even further,
 and only ban input that doesn't contain an @ sign.  (Or that doesn't
 match ^...@]+@[...@]+\.[^@]+$, or whatever.)  Or just don't ban anything
 at all, like with type=tel.  type=email differs from most of the other
 types with validity constraints (like month, number, etc.) in that the
 difference between valid and invalid values is a purely pragmatic
 question (what will actually work?) that the user can often answer
 better than the application.  It doesn't seem like a good idea for the
 standard to tell users that the e-mail addresses they've actually been
 using are invalid.

Users often mis-type e-mail addresses.  It seems useful to be able to
trap as many typos as possible.  Many authors obviously believe this,
given how many employ JavaScript validators.  If HTML 5 were overly
permissive about input type=email then it's likely such authors would
continue to use homegrown JavaScript solutions, which slightly defeats
the purpose of HTML 5 introducing input type=email).

Smylers


Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-24 Thread Aryeh Gregor
On Mon, Aug 24, 2009 at 4:36 AM, Smylerssmyl...@stripey.com wrote:
 It's too complicated for most developers to roll their own validation,
 but there are standard libraries available which get it right.

Standard libraries available for all major languages?  As far as I can
tell from a quick search, the PHP standard library contains no e-mail
validation routines before 5.2.0 -- which isn't yet reliably available
except to the small minority of website admins with root access to
their machines.  Moreover, the e-mail validation in 5.2.0
(filter_var()) seems to be wrong -- apparently it just uses, yes, a
regex.  (Don't use PHP is, obviously, not a useful response here.)

If it were practical for everyone to validate strictly according to
spec on both client and server side, that would be fine.  I assume it
was felt there was good reason not to do this in HTML 5.

 Forms on websites capturing users' e-mail addresses typically want just
 the address part, prompting for the human-readable name in a separate
 box, so I think HTML 5's input type=email not allowing the above is
 helpful.

It might be more helpful if they stripped the part outside the angle
brackets, but I agree that it's reasonable to just reject these.

 There may actually be several categories of oddly placed dots.  While
 the address in the form you give above works it may be, say, that those
 with repeated dots in the hostname part don't work.

 On the specific case of a . immediately before the @, I've seen that
 before: this Perl library module extends an RFC-compliant module to
 allow just that; its author admits .@ breaks the RFCs but claims such
 breakage is useful in the real world, specifically when dealing with
 e-mail addresses for Japanese mobile phones:

  http://search.cpan.org/perldoc?Email::Valid::Loose

 That somebody has found this to be a sufficiently widespread problem
 with standard Perl e-mail address validation to write and upload a
 module which 'fixes' this (and just that; it makes no other changes)
 suggests that people will find HTML 5's input type=email to be
 problematic in precisely the same way.

The breakdown of the 202 is as follows.

* Single trailing dot in domain part: 100 (prohibited by RFC but
plausibly deliverable)
* Single trailing dot in local part: 40 (prohibited by RFC but
plausibly deliverable)
* Valid address in angle brackets (with other junk around it): 21
(permitted by RFC, kind of, and plausibly deliverable)
* Multiple consecutive dots: 20 (prohibited by RFC but plausibly deliverable)
* No @: 9 (unlikely to be deliverable)
* Comment: 3 (permitted by RFC and plausibly deliverable)
* Miscellaneous: 9 (one containing [...@[spam], two with trailing ,
one in quotes, one with single leading dot in local part, two with
single leading comma in local part, one with leading : , one with
leading \)

Again, this excludes ~3000 that would be valid if [ \n\t] were
stripped.  Note that almost all of the hits seem like they probably
are real working e-mail addresses that did have mail successfully sent
to them (as opposed to a few that look like they were only confirmed
by a bug).

 So it may actually be that there isn't a general problem here of lots of
 real-world e-mail addresses which work but don't comply with the RFCs;
 it may simply be the one case of .@?

No, that was just the example I chose because I knew that person
personally, and so was able to confirm that the address actually
worked.  I can't use my database access at Wikipedia to spam people
just to see if their addresses work, so I can't confirm any of the
others directly.

 Users often mis-type e-mail addresses.  It seems useful to be able to
 trap as many typos as possible.  Many authors obviously believe this,
 given how many employ JavaScript validators.  If HTML 5 were overly
 permissive about input type=email then it's likely such authors would
 continue to use homegrown JavaScript solutions, which slightly defeats
 the purpose of HTML 5 introducing input type=email).

I agree, but if the only purpose is to catch typos, it doesn't seem
correct to completely prohibit submission.  At most, you should warn
the user.  Of course, this would be potentially complicated to do.


Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-24 Thread Tab Atkins Jr.
On Mon, Aug 24, 2009 at 8:54 AM, Aryeh Gregorsimetrical+...@gmail.com wrote:
 The breakdown of the 202 is as follows.

 * Single trailing dot in domain part: 100 (prohibited by RFC but
 plausibly deliverable)

Do these still have a normal TLD identifier before the trailing dot?
Or are they just *really* weird?

I almost suspect that these are just simple typos that are cleaned up
by mailers, and could be flagged as invalid.

 * Single trailing dot in local part: 40 (prohibited by RFC but
 plausibly deliverable)

It seems that these are indeed valid in the wild, and so the algorithm
should be loosened to allow these.

 * Valid address in angle brackets (with other junk around it): 21
 (permitted by RFC, kind of, and plausibly deliverable)

I'm fine with flagging these.  The user can just remove the junk.

 * Multiple consecutive dots: 20 (prohibited by RFC but plausibly deliverable)

We need to see if these are actually deliverable.

 * No @: 9 (unlikely to be deliverable)

Flag them.

 * Comment: 3 (permitted by RFC and plausibly deliverable)

What do you mean by this?  Is it just fluff that doesn't affect the
actual routing of the mail?  If so, I'm fine with keeping them
flagged, even if it is allowed by RFC.

 * Miscellaneous: 9 (one containing [...@[spam], two with trailing ,
 one in quotes, one with single leading dot in local part, two with
 single leading comma in local part, one with leading : , one with
 leading \)

It would be nice to see how many of the latter 6 are deliverable.

~TJ


Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-24 Thread Smylers
Aryeh Gregor writes:

 On Mon, Aug 24, 2009 at 4:36 AM, Smylerssmyl...@stripey.com wrote:
 
  It's too complicated for most developers to roll their own
  validation, but there are standard libraries available which get it
  right.
 
 Standard libraries available for all major languages?

I'd be surprised if they weren't.

 As far as I can tell from a quick search, the PHP standard library
 contains no e-mail validation routines before 5.2.0

Sorry, I meant there is a library (meaning additional to the core
language) available in a standard place (wherever that language's
libraries are typically found); I wasn't intending to claim that the
standard library of functionality which is part of a language's core
distribution would include it.

For PHP I Googled email validation Pear and found the following as the
top hit.  I haven't tried it, but it claims to comply to RFC822, and I'd
have more faith in it than the average home-rolled attempt:

  http://pear.php.net/package/Validate/

  Forms on websites capturing users' e-mail addresses typically want
  just the address part, prompting for the human-readable name in a
  separate box, so I think HTML 5's input type=email not allowing
  the above is helpful.
 
 It might be more helpful if they stripped the part outside the angle
 brackets, but I agree that it's reasonable to just reject these.

Good point.  And that's largely a UI matter: either way the web server
doesn't receive a value with the outside clutter in it.

 The breakdown of the 202 is as follows.

Thanks for providing this.

 * Single trailing dot in domain part: 100 (prohibited by RFC but
   plausibly deliverable)

Yup.  If it is deliverable then surely it's an alias to the same address
without the trailing dot, in which case a browser could choose to remove
it.

 * Single trailing dot in local part: 40 (prohibited by RFC but
   plausibly deliverable)

Discussed previously.  This seems to be the problematic category.

 * Valid address in angle brackets (with other junk around it): 21
 (permitted by RFC, kind of, and plausibly deliverable)

Discussed above.

 * Multiple consecutive dots: 20 (prohibited by RFC but plausibly
   deliverable)

If you mean the ..s are in the local part then yes, it sounds likely
that would get delivered, and a quick non-exhaustive trial seemed to
show this can work.

(If they're in the hostname then I'd be amazed if it's deliverable, but
surely it'd be to the same address that's reached by replacing sequences
of dots to a single dot.)

 * No @: 9 (unlikely to be deliverable)

Indeed.

 * Comment: 3 (permitted by RFC and plausibly deliverable)

Equivalent to the angle bracket case above -- the address without the
comment could be extracted.

 * Miscellaneous: 9 (one containing [...@[spam], two with trailing ,
   one in quotes, one with single leading dot in local part, two with
   single leading comma in local part, one with leading : , one with
   leading \)

They don't sound deliverable, or if they are would also be with
superfluous punctuation stripped.  And I'm not sure single cases are
worth fretting about.  If HTML 5 validation rejected one of the above it
seems very likely the user would be able to provide an alternative
address (or alternatively punctuated address) which is valid.

  So it may actually be that there isn't a general problem here of
  lots of real-world e-mail addresses which work but don't comply with
  the RFCs; it may simply be the one case of .@?
 
 No, that was just the example I chose because I knew that person
 personally, and so was able to confirm that the address actually
 worked.

There are two categories of input which could be a working e-mail
address yet violate the RFCs:

  1 A valid e-mail address with extra 'stuff' in it or surrounding it
(spaces, comments, trailing punctuation characters, etc).  As you
suggested, browsers can clean up the user's input, so what servers
receive is a valid e-mail address.  

  2 A working e-mail address which contains something the RFCs say it
shouldn't but needs that in order to function; attempting to clean
it up would transform it to a different e-mail address, which
possibly delivers somewhere differently from the original.

Analysis of your detailed breakdown suggests the only addresses in
category 2 are those with dots in odd places in the local part.

So it may be the only change required to allow all working real-world
e-mail addresses is a willful violation that permits dots anywhere in
the local part (even immediately after another . or before the @).

That change would appear to cover the cases in your data, but others may
have data which shows there are additional cases.

Smylers


Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-24 Thread Aryeh Gregor
On Mon, Aug 24, 2009 at 10:11 AM, Tab Atkins Jr.jackalm...@gmail.com wrote:
 Do these still have a normal TLD identifier before the trailing dot?
 Or are they just *really* weird?

None of the addresses had more than one thing wrong with it.  These
looked like perfectly normal addresses but with a trailing dot, like
f...@example.com..  I assume mailers just drop the trailing dot here.
 example.com. is generally treated the same as example.com by
everything except the actual DNS protocol, AFAIK -- if you resolve
example.com the resolver will usually *append* the dot when it
actually makes the query.

 It seems that these are indeed valid in the wild, and so the algorithm
 should be loosened to allow these.

But the RFC forbids them.  If we're going to even allow things that
sort of work but which the RFC forbids, we may as well allow almost
anything, because who knows if it might work on some software?

 We need to see if these are actually deliverable.

I'd assume so.  In theory all of these should be deliverable.  The
ones without @ obviously aren't, but those all look to have been
confirmed back in 2006, so maybe there was a bug back then.  Addresses
with two or more consecutive dots have been confirmed as recently as
May 2009.

 What do you mean by this?  Is it just fluff that doesn't affect the
 actual routing of the mail?  If so, I'm fine with keeping them
 flagged, even if it is allowed by RFC.

I mean things like

bobsm...@example.com (use for new groups only)

If I'm reading the RFC correctly, the parenthesized part is a comment,
and is ignored (like whitespace).

On Mon, Aug 24, 2009 at 10:42 AM, Smylerssmyl...@stripey.com wrote:
 For PHP I Googled email validation Pear and found the following as the
 top hit.  I haven't tried it, but it claims to comply to RFC822, and I'd
 have more faith in it than the average home-rolled attempt:

  http://pear.php.net/package/Validate/

I stand corrected, assuming that's usable for people with only FTP
access.  (It looks like it is, at a glance, since it's seemingly pure
PHP.)  Given this, I'm not clear why there's a need to deviate from
the RFCs here.  I assume the burden on UA implementors wouldn't be all
that much.  Granted, many web developers seem not to be using these
validation libraries server-side, but I don't see how using different
standards for input type=email helps that.

 Yup.  If it is deliverable then surely it's an alias to the same address
 without the trailing dot, in which case a browser could choose to remove
 it.

Yes, it's not possible for example.com. to mean anything different
from example.com.  (In fact they do mean something different in DNS,
but example.com. means the same thing as what example.com is
normally used to mean.  Moreover, the meaning of example.com in DNS
is basically nonsense for web apps processing user-submitted e-mail
addresses.  At least, as far as I understand it; I don't know too much
about DNS.)

 Discussed previously.  This seems to be the problematic category.

I wouldn't rule out the existence of other problematic categories that
happen not to have cropped up on the English Wikipedia.

 If you mean the ..s are in the local part then yes, it sounds likely
 that would get delivered, and a quick non-exhaustive trial seemed to
 show this can work.

 (If they're in the hostname then I'd be amazed if it's deliverable, but
 surely it'd be to the same address that's reached by replacing sequences
 of dots to a single dot.)

Agreed.  Of course, they're all in the local part.

 They don't sound deliverable, or if they are would also be with
 superfluous punctuation stripped.  And I'm not sure single cases are
 worth fretting about.  If HTML 5 validation rejected one of the above it
 seems very likely the user would be able to provide an alternative
 address (or alternatively punctuated address) which is valid.

The one with a leading dot might be legitimate.  I'd imagine the
others are errors.

 There are two categories of input which could be a working e-mail
 address yet violate the RFCs:

  1 A valid e-mail address with extra 'stuff' in it or surrounding it
(spaces, comments, trailing punctuation characters, etc).  As you
suggested, browsers can clean up the user's input, so what servers
receive is a valid e-mail address.

  2 A working e-mail address which contains something the RFCs say it
shouldn't but needs that in order to function; attempting to clean
it up would transform it to a different e-mail address, which
possibly delivers somewhere differently from the original.

 Analysis of your detailed breakdown suggests the only addresses in
 category 2 are those with dots in odd places in the local part.

 So it may be the only change required to allow all working real-world
 e-mail addresses is a willful violation that permits dots anywhere in
 the local part (even immediately after another . or before the @).

 That change would appear to cover the cases in your data, but others may
 

Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-24 Thread Michelangelo De Simone
2009/8/24 Peter Kasting pkast...@google.com:

 I am mentoring a student who is writing a patch for this in WebKit as we
 speak -- we were just discussing the implementation yesterday and I believe
 he hopes to have it out for review tomorrow.

The mentored student has published the patch and is waiting for
comments, however this is the pattern I've used:
dotAtomText = [a-z0-9!#$%'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%'*+/=?^_`{|}~-]+)*

Value is valid if matches entirely dotAtomText@dotAtomText. Every
feedback will be appreciated.

-- 
Bye,
Michelangelo


Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-24 Thread Brian Campbell

On Aug 24, 2009, at 3:24 PM, Aryeh Gregor wrote:

Yup.  If it is deliverable then surely it's an alias to the same  
address
without the trailing dot, in which case a browser could choose to  
remove

it.


Yes, it's not possible for example.com. to mean anything different
from example.com.  (In fact they do mean something different in DNS,
but example.com. means the same thing as what example.com is
normally used to mean.  Moreover, the meaning of example.com in DNS
is basically nonsense for web apps processing user-submitted e-mail
addresses.  At least, as far as I understand it; I don't know too much
about DNS.)


Actually, the trailing dot is meaningful. A domain without a trailing  
dot is a relative domain; for example, if you are within the  
example.com domain, then foo could resolve to  
foo.example.com (or if that doesn't exist, then it would try  
resolving that at the root level, and fail since foo is not a TLD).  
A domain with a trailing dot is an absolute domain; it will only ever  
be resolved at the root level.


This difference may be significant. If someone manages to register the  
top level domain mail (which may be possible if the proposed new  
gTLD rules are passed), and has an email address of f...@mail, then  
you might want to distinguish between that resolving to f...@mail.wikimedia.org 
 vs. f...@mail.


Of course, this is complicated because the trailing dot is technically  
not allowed in an email address, but it seems to work in some contexts  
that I've tried (though most just strip off the trailing dot).


About the more general subject of this thread, I have tested sending  
myself email at all of the following addresses, all of which seem to  
work just fine, though some generate warnings in my mail client (Apple  
mail):


Brian P. campb...@dartmouth.edu
...brian...p...campbell...@dartmouth.edu
brian.p.campb...@dartmouth.edu.
Brian (this is a test) P (of comments) Campbell (and whitespace)@(here  
comes the domain) dartmouth.edu

brian p campbell

Note that Dartmouth has a very permissive email system that allows  
name components to be delimited by whitespace and/or periods, and  
prefixes of name components as long as you wind up with a unique  
match. And of course the address without the domain only works when  
I'm sending within the same domain. In some cases, the addresses were  
altered slightly in the process of being sent, for example 'Brian P. campb...@dartmouth.edu 
' came through as 'Brian P. Campbell@dartmouth.edu'.


Given that there are so many technically invalid addresses that  
actually do work to deliver mail, and that I'm sure some people have  
odd addresses due to poor form validation (perhaps someone has signed  
up for an email account on a web form and it allowed spaces in the  
address), it's probably best to be relatively lenient about the  
addresses allowed. I think the best you can do is look for at least  
one character, followed by an @ sign, followed by a legal domain name  
(which seems to be more strictly checked, though given the presence of  
IDNs, may not be easy to restrict in the future as well).


-- Brian


Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-24 Thread TAMURA, Kent

FYI.
I was in Gmail team and wrote the email address validation code which we are
currently using.
Gmail's validation rules are:
 - require @
 - local-part should be
   - quoted-string without CFWS and FWS, or
   - 1*(atext / .)This means dot-atom-text without . restriction.
This looseness was introduced for Japanese cell phone addresses.
 - domain-part should be [a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+
It requires at least 1 dot.  The last non-dot sequence should have at
least 2 characters.

I have never heard requests to support for non-ASCII characters other than
IDN.

--
TAMURA Kent
Software Engineer, Google





Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-24 Thread TAMURA, Kent

  - domain-part should be [a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+


Correction.  - is allowed for domain-part.

--
TAMURA Kent
Software Engineer, Google





[whatwg] Comments on the definition of a valid e-mail address

2009-08-23 Thread Aryeh Gregor
Section 4.10.4.1.5 defines a valid e-mail address as follows:

A valid e-mail address is a string that matches the production
dot-atom-text @ dot-atom-text where dot-atom-text is defined in RFC
5322 section 3.2.3. [RFC5322]

This is much more restrictive than the full range of e-mail addresses
allowed by RFC 5322 et al.  I've been considering whether to use
input type=email in MediaWiki, and whether to change our server-side
e-mail address validation to match.  Historically, MediaWiki has
mostly just required that an @ symbol be present in the address.
Originally we used a simplistic regex, but when users complained, we
looked into the RFCs and decided it was too complicated to bother with
validation beyond checking for an @ sign.

So before switching us over, I decided to do some research on how many
users' addresses would be invalidated.  I used the database for the
English Wikipedia.  Over all registered users, I found 3,088,880
confirmed addresses, not necessarily all distinct.  (Confirmed here
means that in theory, modulo bugs, the user followed a confirmation
link in the e-mail they received, so the address probably works in
practice.)  Of those, 3,255 (~0.1%) failed HTML 5 validation, as
determined using the following regex-based database query:

r...@rosemary:enwiki SELECT COUNT(*) FROM user WHERE
user_email_authenticated IS NOT NULL AND user_email NOT REGEXP
'^[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+)*...@[-a-za-z0-9!#$%\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+)*$'
AND user_email != '';
+--+
| COUNT(*) |
+--+
| 3255 |
+--+
1 row in set (16 min 10.80 sec)

(Someone please tell me if my regex doesn't match HTML 5 here.)

Inspection showed that the overwhelming majority of the failures were
due to the presence of excess whitespace, often a single trailing
space, or a space inserted before or after the @ sign.  When I
adjusted the regex to ignore those failures, I got a smaller list, 202
(about 0.007% of the total):

r...@rosemary:enwiki SELECT CONCAT('', user_email, ''),
user_email_authenticated FROM user WHERE user_email_authenticated IS
NOT NULL AND user_email NOT REGEXP '^[
\t\n]*[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~
\t\n]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~
\n\t]+)*...@[-a-za-z0-9!#$%\'*+/=?^_`{|}~
\n\t]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~ \n\t]+)*[ \n\t]*$' AND
user_email NOT REGEXP
'^[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+)*...@[-a-za-z0-9!#$%\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+)*$'
AND user_email != '' LIMIT 500;
+---+--+
| CONCAT('', user_email, '')
 | user_email_authenticated |
+---+--+
...snip...
+---+--+
202 rows in set (13 min 1.71 sec)

Some of these were clearly wrong, and shouldn't have been confirmed to
begin with.  Some even didn't have an @ sign, so probably were
submitted in some window when we did no validation at all (and I have
no idea how they got confirmed).  Of the ones that possibly work, I
identified two major categories:

1) Addresses in the form foo b...@baz.example, or similar.  These
mostly match RFC 5322's name-addr production instead of addr-spec
(some have trailing semicolons, or are missing the initial , etc.).
I assume these were copy-pasted from a mail application.

2) Addresses with dots in incorrect places, in either the local part
or the domain name part.  For instance, multiple consecutive dots, or
leading/trailing dots.  These don't match RFC 5322 at all AFAICT, but
I asked one of the users with an invalid address of the form
f...@example.com, and he said it worked fine for him.  GNU mail gave
a syntax error when I tried to send mail to that address, but Gmail
sent it without complaint, and the user received it successfully.

There were other types of addresses that didn't meet HTML 5's
specification after whitespace was stripped, but none with more than a
single-digit number of addresses occurring in the sample of three
million or so that I looked at.  It's notable that not a single one
used the quoted-string or domain-literal productions, as far as I
could tell from manual inspection.

I should also note that this was only the English Wikipedia, and it
might be that speakers of other languages are more prone to use other
types of addresses that don't meet HTML 5's specification.  When
looking at the Swedish and German databases, for instance, I found one
or two addresses that had apparently been confirmed but contained
non-ASCII characters.  I didn't know the users with those addresses,
and I didn't want to send them unsolicited mail, so I wasn't able to
establish whether those addresses actually worked or the confirmation
was bogus.


Conclusions: At a minimum, I suggest that HTML 5 

Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-23 Thread David Gerard
2009/8/23 Aryeh Gregor simetrical+...@gmail.com:

   Or just don't ban anything
 at all, like with type=tel.  type=email differs from most of the other
 types with validity constraints (like month, number, etc.) in that the
 difference between valid and invalid values is a purely pragmatic
 question (what will actually work?) that the user can often answer
 better than the application.  It doesn't seem like a good idea for the
 standard to tell users that the e-mail addresses they've actually been
 using are invalid.


+1

The quoted portion above strikes to the heart of the matter. I suppose
the spec wants to obviate defective email validation JavaScript, but
any restriction will (a) break stuff the user thinks should work (b)
not stop bad web coders for a second.



- d.


Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-23 Thread Aryeh Gregor
On Sun, Aug 23, 2009 at 3:41 PM, Aryeh Gregorsimetrical+...@gmail.com wrote:
 Alternatively, you could just loosen the restrictions even further,
 and only ban input that doesn't contain an @ sign.  (Or that doesn't
 match ^...@]+@[...@]+\.[^@]+$, or whatever.)  Or just don't ban anything
 at all, like with type=tel.  type=email differs from most of the other
 types with validity constraints (like month, number, etc.) in that the
 difference between valid and invalid values is a purely pragmatic
 question (what will actually work?) that the user can often answer
 better than the application.  It doesn't seem like a good idea for the
 standard to tell users that the e-mail addresses they've actually been
 using are invalid.

. . . and I should add that I think it might be useful to have an note
recommending that application authors not do any validation beyond
what the spec ends up mandating as required (preferably almost
nothing).  I've had a lot of problems with sites that think + isn't
valid in e-mail addresses, including pretty major sites that should
know better.  You don't really know if it will work anyway until you
try actually sending mail to it -- maybe the local part was mistyped
or invented -- so why not just do that?


Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-23 Thread Tab Atkins Jr.
Thanks for doing this work, Aryeh!  It's really awesome!

On Sun, Aug 23, 2009 at 2:41 PM, Aryeh Gregorsimetrical+...@gmail.com wrote:
 Beyond that, although it's safe to say that quoted-string or
 domain-literal or even entirely invalid addresses are extraordinarily
 rare, there are *some* real people who do use them.  Unless something
 is so completely invalid that it's obviously impossible that any mail
 server would even try to send it anywhere, you're probably going to be
 cutting out some small number of users.

Unless you avoid validating *entirely*, there's virtually always going
to be some subset of theoretically valid addresses that you'll flag as
invalid, though.

 So why not have the spec say that in the case of e-mail addresses, the
 browser may warn the user, but should permit them to submit the
 address anyway?  If the user is willing to override the warning, then
 it's likely that they personally know that the e-mail address works,
 e.g., because they use it.

I'd be okay with this.

 Alternatively, you could just loosen the restrictions even further,
 and only ban input that doesn't contain an @ sign.  (Or that doesn't
 match ^...@]+@[...@]+\.[^@]+$, or whatever.)  Or just don't ban anything
 at all, like with type=tel.  type=email differs from most of the other
 types with validity constraints (like month, number, etc.) in that the
 difference between valid and invalid values is a purely pragmatic
 question (what will actually work?) that the user can often answer
 better than the application.  It doesn't seem like a good idea for the
 standard to tell users that the e-mail addresses they've actually been
 using are invalid.

Unlike type=tel, emails have a relatively simply format which *very
nearly everyone* uses.  I agree that if an email works but is one of
those crazy formats it's probably not a good idea to bar them from
using it, but in practice that's exactly what happens right now with
email validation scripts.  If type=email doesn't validate at all
people will still just continue to use their broken homebrew
validators both on client-side and server-side.

It's possible that a token validation step would be sufficient, but I
suspect not.  Probably just a slight loosening of the allowed format,
informed by actual data such as what you gathered, would work fine,
possibly augmented by your suggestion of making type=email flag
'invalid' addresses but not actually prevent them from being
submitted.

Would you mind sharing these 200 or so that don't validate?  Obviously
there are privacy concerns, but I think it would be sufficient to just
replace every alpha character with 'x' and every numeric with '0', or
some similar information-removing transformation.  None of them fail
validation because of the letters or numbers used, so that would still
give us the information we need without revealing stuff we don't.

~TJ


Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-23 Thread Aryeh Gregor
On Sun, Aug 23, 2009 at 4:00 PM, Tab Atkins Jr.jackalm...@gmail.com wrote:
 Unless you avoid validating *entirely*, there's virtually always going
 to be some subset of theoretically valid addresses that you'll flag as
 invalid, though.

There shouldn't be, IMO, if the browser is forbidden to submit them.

 Unlike type=tel, emails have a relatively simply format which *very
 nearly everyone* uses.  I agree that if an email works but is one of
 those crazy formats it's probably not a good idea to bar them from
 using it, but in practice that's exactly what happens right now with
 email validation scripts.  If type=email doesn't validate at all
 people will still just continue to use their broken homebrew
 validators both on client-side and server-side.

They'll probably do that anyway.  HTML 5 doesn't have to mandate it.

 Would you mind sharing these 200 or so that don't validate?  Obviously
 there are privacy concerns, but I think it would be sufficient to just
 replace every alpha character with 'x' and every numeric with '0', or
 some similar information-removing transformation.  None of them fail
 validation because of the letters or numbers used, so that would still
 give us the information we need without revealing stuff we don't.

I doubt it would be useful.  I summarized all the interesting points,
and remember that these are only 0.007% of the total.  Also, note that
it was 3255 of them that didn't validate.  It was 202 that didn't
validate even after the regex was adjusted to allow whitespace
everywhere (should be equivalent to stripping 0x9, 0xA, 0x20 from
email input).