On Tuesday, September 13, 2016 at 8:19:03 AM UTC-7, Ryan Sleevi wrote:
> On Tuesday, September 13, 2016 at 7:56:20 AM UTC-7, Peter Bowen wrote:
> > I would be careful reading too much into server names.
> > mail.[example.com] might host web based email access. For example,
> > I'm typing this into a site called mail.google.com :)
> Apologies that the conjunctive and was not clearer, and that it seemed more
> enumerative. My point was that some certificates demonstrate patterns - such
> as *both* names - that offer reasonable signals of use.
> I agree that any heuristic approach leaves me profoundly uncomfortable as a
> policy, but I would also suggest that some patterns in the certs are signals
> that perhaps the impact to users, however great, may be overestimated.
> Of course, all of this is based on the data we have - I agree, that if
> StartCom were to log its 2015/2016 certs, we'd be in a much better place to
> evaluate viability of minimizing user impact, if such a thing is at all
For further sake of exploring options, I've been looking at non-public sources
to see what other options exist as alternatives.
One example set was looking at the hosts visited by GoogleBot over a 60 day
period and seeing if any of the certificates seen for a host matched the
certificates logged in CT.
That is, imagine the key as being constructed from [hash of cert] + [hostname
from SAN] for certificates from CT, and in cases of GoogleBot crawls, [hash of
cert] + [hostname from link] and [hash of cert] [*.hostname minus a label].
That is, if GoogleBot crawled "www.google.com", it would emit keys for both
"*.google.com" and "www.google.com" (to allow it to match with a cert for
either name, since browsers will accept either name)
While unfortunately, I'm unable to share the specific results, even in buckets,
it does suggest that if one were to examine hosts reported in these
certificates, with whether or not they use these certificates or are publicly
accessible, and further intersect with the Alexa Top 1M, any whitelisting
strategy (by host, by domain, or by certificate) could fit in under 50K, with
some strategies going below 10K. The reasoning for this is that a number of
hosts represented in the certificate don't use the certificate, and instead use
it from some other CA provider. A number have switched, for example, to Let's
Encrypt, obviating the need for whitelisting.
Unfortunately, that's not easily publicly reproducible, which I think is an
important aspect for consideration here.
So let's again revisit the combined set of WoSign & StartCom certs (which
necessarily includes everything GoogleBot has ever seen, but not necessarily
any undisclosed and undetected StartCom certs)
We know there are 5769 unique certificate hashes with wildcards in the Alexa
Top 1M, over 2710 distinct eTLD+1s. There are 61,109 certs that contain
non-wildcard hosts, over 18,650 distinct eTLD+1s.
Another possibility to explore, then, is to attempt to communicate with each of
these hosts and see the certificate they provide, since we can't use hosts
mined by Google's crawler (oh how I wish we could). If they provide one of
these certificates, the eTLD+1 could be whitelisted, as well as the generous
assumption that all wildcard hosts are using their certificates (I believe
there's sufficient evidence this isn't the case, but sure).
This may help reduce the overall 18,763 distinct eTLD+1s into a even more
compressible set, albeit at the cost of potentially excluding some certificates
that were (undetectably) in use.
dev-security-policy mailing list