Steve Atkins wrote:
>
> On Mar 12, 2010, at 5:18 PM, Tom Lane wrote:
>
> > Bruce Momjian <[email protected]> writes:
> >> Well, I think the big question is whether we need to honor RFC 5322
> >> (http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
> >> all valid characters:
> >
> >> http://en.wikipedia.org/wiki/E-mail_address
> >
> >> * Uppercase and lowercase English letters (a-z, A-Z)
> >> * Digits 0 to 9
> >> * Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
> >> * Character . (dot, period, full stop) provided that it is not the
> >> first or last character, and provided also that it does not appear two
> >> or more times consecutively.
> >
> > That's an awful lot of special characters. For the RFC's purposes,
> > it's not hard to be flexible because in an email message there is
> > external context telling where to expect an address. I think if we
> > tried to allow all of those in email addresses in tsearch, we'd have
> > "email addresses" gobbling up a whole lot of adjacent text, to nobody's
> > benefit.
> >
> > I can see the case for adding "+" because that's fairly common as Alvaro
> > notes, but I think we should be very circumspect about going farther.
>
> I've been working with recognizing email addresses in text for
> years, with many millions of documents processed. Recognizing
> them in text is a very different problem to validating them or sanitizing
> them. Using the RFC spec to match things that "might be an email
> address" isn't a great idea in the wild, so +1 on the circumspect.
>
> I've found that /[a-z0-9_][^<\"@\\s]{0,80})@/ is good at finding local parts
> of "real" email addresses in free text in the wild, without getting being
> too prone to grab things that just look vaguely like email addresses.
> Obviously
> there are some things it'll match that aren't email addresses, and some
> email addresses it won't match, but for indexing it's been really pretty
> good when combined with a good regex for domain parts[1].
OK, based on your experience, I think we have gone far enough by
allowing underscores. I have applied the attached patch to document
what symbols we do allow.
Just for thrills, I want to point out that even the description is not
accurate. Look what happens when a dash follows an underscore:
test=> select ts_parse('default', ' [email protected] ' );
ts_parse
---------------------
(12," ")
(4,[email protected])
(12," ")
(3 rows)
test=> select ts_parse('default', ' [email protected] ' );
ts_parse
-----------------
(12," ")
(16,a-b)
(11,a)
(12,-)
(11,b)
(12,-_)
(4,[email protected])
(12," ")
(8 rows)
--
Bruce Momjian <[email protected]> http://momjian.us
EnterpriseDB http://enterprisedb.com
PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
Index: doc/src/sgml/textsearch.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/textsearch.sgml,v
retrieving revision 1.53
diff -c -c -r1.53 textsearch.sgml
*** doc/src/sgml/textsearch.sgml 14 Aug 2009 14:53:20 -0000 1.53
--- doc/src/sgml/textsearch.sgml 13 Mar 2010 03:03:24 -0000
***************
*** 1943,1948 ****
--- 1943,1955 ----
languages, token types <literal>word</> and <literal>asciiword</>
should be treated alike.
</para>
+
+ <para>
+ <literal>email</> does not support all valid email characters as
+ defined by RFC 5322. Specifically, the only non-alphanumeric
+ characters supported for email user names are period, dash, and
+ underscore.
+ </para>
</note>
<para>
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers