On Tuesday, September 13, 2016 at 8:19:03 AM UTC-7, Ryan Sleevi wrote:
> On Tuesday, September 13, 2016 at 7:56:20 AM UTC-7, Peter Bowen wrote:
> > I would be careful reading too much into server names.
> > mail.[example.com] might host web based email access.  For example,
> > I'm typing this into a site called mail.google.com :)
> 
> Apologies that the conjunctive and was not clearer, and that it seemed more 
> enumerative. My point was that some certificates demonstrate patterns - such 
> as *both* names - that offer reasonable signals of use.
> 
> I agree that any heuristic approach leaves me profoundly uncomfortable as a 
> policy, but I would also suggest that some patterns in the certs are signals 
> that perhaps the impact to users, however great, may be overestimated.
> 
> Of course, all of this is based on the data we have - I agree, that if 
> StartCom were to log its 2015/2016 certs, we'd be in a much better place to 
> evaluate viability of minimizing user impact, if such a thing is at all 
> possible.

For further sake of exploring options, I've been looking at non-public sources 
to see what other options exist as alternatives.

One example set was looking at the hosts visited by GoogleBot over a 60 day 
period and seeing if any of the certificates seen for a host matched the 
certificates logged in CT.

That is, imagine the key as being constructed from [hash of cert] + [hostname 
from SAN] for certificates from CT, and in cases of GoogleBot crawls, [hash of 
cert] + [hostname from link] and [hash of cert] [*.hostname minus a label]. 
That is, if GoogleBot crawled "www.google.com", it would emit keys for both 
"*.google.com" and "www.google.com" (to allow it to match with a cert for 
either name, since browsers will accept either name)

While unfortunately, I'm unable to share the specific results, even in buckets, 
it does suggest that if one were to examine hosts reported in these 
certificates, with whether or not they use these certificates or are publicly 
accessible, and further intersect with the Alexa Top 1M, any whitelisting 
strategy (by host, by domain, or by certificate) could fit in under 50K, with 
some strategies going below 10K. The reasoning for this is that a number of 
hosts represented in the certificate don't use the certificate, and instead use 
it from some other CA provider. A number have switched, for example, to Let's 
Encrypt, obviating the need for whitelisting.

Unfortunately, that's not easily publicly reproducible, which I think is an 
important aspect for consideration here.

So let's again revisit the combined set of WoSign & StartCom certs (which 
necessarily includes everything GoogleBot has ever seen, but not necessarily 
any undisclosed and undetected StartCom certs)

We know there are 5769 unique certificate hashes with wildcards in the Alexa 
Top 1M, over 2710 distinct eTLD+1s. There are 61,109 certs that contain 
non-wildcard hosts, over 18,650 distinct eTLD+1s.

Another possibility to explore, then, is to attempt to communicate with each of 
these hosts and see the certificate they provide, since we can't use hosts 
mined by Google's crawler (oh how I wish we could). If they provide one of 
these certificates, the eTLD+1 could be whitelisted, as well as the generous 
assumption that all wildcard hosts are using their certificates (I believe 
there's sufficient evidence this isn't the case, but sure).

This may help reduce the overall 18,763 distinct eTLD+1s into a even more 
compressible set, albeit at the cost of potentially excluding some certificates 
that were (undetectably) in use.
_______________________________________________
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy

Reply via email to