On 01/09/2011 01:09 AM, Ashley Sheridan wrote:
On Sat, 2011-01-08 at 16:55 +0800, WalkinRaven wrote:


Regular Express to match domain names format according to RFC 1034 -

    [a-z]                 |
    [a-z] (?:[a-z]|[0-9]) |
    [a-z] (?:[a-z]|[0-9]|\-){1,61} (?:[a-z]|[0-9])                      ) # One 

(?:\.(?1))*+        # More labels
\.?                 # Root domain name

This rule matches only<label>  and<label>. but not<label>.<label>...

I don't know what wrong with it.

Thank you.

I think trying to do all of this in one regex will prove more trouble
than it's worth. Maybe breaking it down into something like this:

$domain = "www.ashleysheridan.co.uk";
$valid = false;

$tlds = array('aero', 'asia', 'biz', 'cat', 'com', 'coop', 'edu', 'gov',
'info', 'int', 'jobs', 'mil', 'mobi', 'museum', 'name', 'net', 'org',
'pro', 'tel', 'travel', 'xxx', 'ac', 'ad', 'ae', 'af', 'ag', 'ai', 'al',
'am', 'an', 'ao', 'aq', 'ar', 'as', 'at', 'au', 'aw', 'ax', 'az', 'ba',
'bb', 'bd', 'be', 'bf', 'bg', 'bh', 'bi', 'bj', 'bm', 'bn', 'bo', 'br',
'bs', 'bt', 'bv', 'bw', 'by', 'bz', 'ca', 'cc', 'cd', 'cf', 'cg', 'ch',
'ci', 'ck', 'cl', 'cm', 'cn', 'co', 'cr', 'cu', 'cv', 'cx', 'cy', 'cz',
'de', 'dj', 'dk', 'dm', 'do', 'dz', 'ec', 'ee', 'eg', 'er', 'es', 'et',
'eu', 'fi', 'fj', 'fk', 'fm', 'fo', 'fr', 'ga', 'gb', 'gd', 'ge', 'gf',
'gg', 'gh', 'gi', 'gl', 'gm', 'gn', 'gp', 'gq', 'gr', 'gs', 'gt', 'gu',
'gw', 'gy', 'hk', 'hm', 'hn', 'hr', 'ht', 'hu', 'id', 'ie', 'il', 'im',
'in', 'io', 'iq', 'ir', 'is', 'it', 'je', 'jm', 'jo', 'jp', 'ke', 'kg',
'kh', 'ki', 'km', 'kn', 'kp', 'kr', 'kw', 'ky', 'kz', 'la', 'lb', 'lc',
'li', 'lk', 'lr', 'ls', 'lt', 'lu', 'lv', 'ly', 'ma', 'mc', 'md', 'me',
'mg', 'mh', 'mk', 'ml', 'mm', 'mn', 'mo', 'mp', 'mq', 'mr', 'ms', 'mt',
'mu', 'mv', 'mw', 'mx', 'my', 'mz', 'na', 'nc', 'ne', 'nf', 'ng', 'ni',
'nl', 'no', 'np', 'nr', 'nu', 'nz', 'om', 'pa', 'pe', 'pf', 'pg', 'ph',
'pk', 'pl', 'pm', 'pn', 'pr', 'ps', 'pt', 'pw', 'py', 'qa', 're', 'ro',
'rs', 'ru', 'rw', 'sa', 'sb', 'sc', 'sd', 'se', 'sg', 'sh', 'si', 'sj',
'sk', 'sl', 'sm', 'sn', 'so', 'sr', 'st', 'su', 'sv', 'sy', 'sz', 'tc',
'td', 'tf', 'tg', 'th', 'tj', 'tk', 'tl', 'tm', 'tn', 'to', 'tp', 'tr',
'tt', 'tv', 'tw', 'tz', 'ua', 'ug', 'uk', 'us', 'uy', 'uz', 'va', 'vc',
've', 'vg', 'vi', 'vn', 'vu', 'wf', 'ws', 'ye', 'yt', 'za', 'zm',
'zw', );

if(strlen($domain<= 253))
        $labels = explode('.', $domain);
        if(in_array($labels[count($labels)-1], $tlds))
                for($i=0; $i<count($labels) -1; $i++)
                        if(strlen($labels[$i])<= 63&&  
\-]*?[a-z0-9]$/', $labels[$i]) || preg_match('/^[0-9]+$/',
$labels[$i]) ))
                                $valid = false;
                                break;  // no point continuing if one label is 
                                $valid = true;


This matches the last label with a TLD, and each label thereafter
against the standard a-z0-9 and hyphen rule as indicated in the
preferred characters allowed in a label (LDH rule), with the start and
end character in a label isn't a hyphen (oddly enough it doesn't mention
starting with a digit!)

Also, each label is checked to ensure it doesn't run over 63 characters,
and the whole thing isn't over 253 characters. Lastly, each label is
checked to ensure it doesn't completely consist of digits.

I've tested it only with my domain so far, but it should work fairly
well. As I said before, I couldn't think of a way to do it all with one
regex. It could probably be done, but would you really want to create a
huge and difficult to read/understand expression just because it's


Thank you for replying, Ash.

I know it may better to pre-deal it with explode()-like, and then we will get a less complex regular express. But I just want to know what the problem in my Regular express.

And the code you've offered, I don't like the idea of a limited set of suffix, for when it may be updated some times. I just want to do format validation, not content validation.

And the regular express itself, yes it is complex, but I've checked it times very carefully -- letter by letter -- I just don't understand what's wrong with it. Or there is some bug in PCRE engine?

