The risk Matt identified is too nebulous of an issue to address, tbh. How do you address a moral issue? The only way I can think of to address the moral issue is to say “we promise to be good”. But the weight that carries depends on how much you trust the actor. If you trust the actor, then the moral issue is addressed. If you don’t trust the actor, moral issue is not addressed. If you or Matt can identify a specific threat you’d like me to address about the moral issue, I’ll do my best to respond.
* What happens is that you ask why there is risk of outage to begin with and what can be done to improve going forward? Let’s assume you do revoke, and it causes an outage - is DigiCert taking steps to ensure no customer of theirs is ever faced with that risk? If so, what are those steps? Yeah – there are several things we can do to improve going forward: 1. Communicate better with the customers. The first mistake was waiting until we had good data to communicate with the customers. This delayed notification. This was unknown to me at the time, or we would have sent out communication prior to the ballot passing. That instruction has been passed along (no waiting on these critical issues) plus training. 2. No more skipping CAB Forum meetings for me. This was easily a foreseeable issue because we knew people couldn’t replace in January. I think it’s been brought up a half dozen times in the forum at least. I’m not sure why we didn’t communicate this in Shanghai. But, the real problem is I didn’t have direct knowledge of what was going on. I probably need to be there in person each time so we can align the company correctly with that is going on. I don’t think we can ever take steps to ensure that no customer is ever faced with the risk of revoked certs. I’m sure there will be other items that are adopted we don’t foresee. That said, we do promote automation, short-lived certs (you can get anything from about 8 hours up through our system), and CT logging. I think the biggest surprise on this one was it applied to certs that are no longer trusted by Mozilla or Google. > This seems to suggest that perhaps other CAs have prepared their customers > for revocation. How does this surprise - that no other CA faces this - lead > to tangible changes in the business processes? How would this change, if > another CA did have the same issue? Surely you can see there are real and > fundamental issues that you’re uniquely qualified to help your customers > address in ways that we cannot. I suppose they did prepare better. Maybe other CAs are just smarter than me? I won’t leave that off the table. I agree that we are uniquely positioned to help our customers remediate. Definitely anxious to do that (and are doing so). * Have you analyzed CT, for example, to see why DigiCert is unique? Certainly, by sheer volume, it's heavily tilted towards the old Symantec infrastructure - and the customers that came over to DigiCert. With those sorts of details, how does this change how things were done, or how they will be done? We do know most of the customers were legacy Symantec, but there are definitely some DigiCert customers in there. I think we still continue the same course. It’s only been a year from the transition, and we’ve migrated nearly everyone off the Symantec infrastructure. Next comes shutting down all the legacy Symantec systems. * I’m not trying to pick on y’all - I think it is legitimately good that you provided concrete data. Even if you do revoke on Jan 15, this is still useful to understand the challenges, but only if this leads to meaningful changes. What might those look like? I appreciate that. I think these are all fair questions, and I’m trying my best to answer them. I especially don’t feel picked on since we’re requesting the information/decision on what to do. I don’t know how to answer the question of what changes to make because I was a bit blindsided by the decision to revoke the certs. Probably shouldn’t have been considering the conversation at the CAB Forum. My number one priority right now is to shut down all of the legacy Symantec systems. Last year was mostly migration of issuance and trying to get the systems up to an expected caliber of performance. At the same time we’re introducing industry-standard (and above) automation of issuance and deployment systems that we hope will help people replace certificates faster. * And this is the framing that I think is incredibly helpful. Understanding why customers can’t change, and what steps are being done to ensure they can, is hugely useful. Wayne’s question were to this point - as were mine towards understanding the problem from the other side, which are steps the CA is taking. As I've repeatedly highlighted from https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation , the goal is not punishment - but understanding how these issues are being addressed. The main blocker for all of these is policy, not technology. I don’t know how to solve third party policy decisions, which is why I can’t seem to answer the questions. The process of planning a change, getting sign-off, rolling the change to stage, getting more sign-off, and then rolling to production with final testing combined with the blackout periods is making something that should be easy very difficult. I run an agile team at DigiCert so none of these are concerns when we roll a change internally. It’s the revocation part that is getting people up in arms. The consistent message I’ve gotten from customers is that changing domains and certificates requires the same process. It’s just as fast to roll out a change to both items as change just a certificate. The built-in CAB Forum 30 day cert requirement isn’t solving the issue because of the way they roll changes, not because the 30 day certs aren’t available. * This seems like a significant improvement from “100% of customers can’t” Definitely an improvement. I’m hoping to get to 100% by the time we hit Jan 15th. The four I posted (and one more I got more info from today) probably won’t. Even within those customers, we’re asking them identify specifically which certificates cannot be replaced in time. * I mean, it’s two-fold, right? Any incident can lead to total distrust, but it’s also unlikely that a single incident leads to total distrust. The way to balance those competing statements is to do what you’re doing - and to be transparent. As Matt has highlighted, there’s a huge risk here that this leads to a moral hazard - and the best way to mitigate that is to discuss steps being taken to reduce that risk going forward, particularly about what a core part of the problem statement is - difficulty in revocation. This isn’t our first incident sadly ☹. It probably won’t be our last. The transition from Symantec to DigiCert was….rough. * In a number of ways, an unintentional violation is worse than an intentional violation. Ignorance is not really an excuse when you hold keys to the Internet, and being asleep at the wheel is hugely dangerous. So, if I had to pick between an intentional violation and an unintentional (and preventable) violation, I'd likely pick intentional. But there's also a huge hazard with intentional violations - those reveal potentially systemic issues and a lack of good faith, especially if they become common-place. We definitely saw CAs perform intentional violations and notify after-the-fact, and that's far, far worse than those that notify before intentionally violating (I think every post-facto notification for intentional incident has, eventually, lead to that CAs distrust). Totally agree. I really don’t want to violate the BRs, and this shouldn’t be the norm. I also recognize we don’t want to invite this question for every BR change. Maybe better Mozilla guidelines about what’s acceptable requests and what’s not? * So somewhere on the scale of things, we're in a better place than most every alternative. But to ensure this is in that 'good faith' side of things, understanding what the factors are that have been evaluated, and what steps are being taken to prevent this, are significant. As I said, I think the principles captured in https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation and in the discussion about how at least some of us see this (that it's related to underscores incident response) suggests that it's not, in fact, the end of the world, or the CA, provided that meaningful data behind the decision to not revoke is given, meaningful plans and timelines for resolution are given, and meaningful steps to prevent this from ever happening again are given. It becomes an incident report, and the result is not a stern lecture - but concrete and quantifiable steps as to how to improve. Thanks Ryan. This post was really nice. Appreciate it. From: Ryan Sleevi <[email protected]> Sent: Thursday, December 27, 2018 7:15 PM To: Jeremy Rowley <[email protected]> Cc: James Burton <[email protected]>; Ryan Sleevi <[email protected]>; mozilla-dev-security-policy <[email protected]> Subject: Re: Underscore characters On Thu, Dec 27, 2018 at 6:56 PM Jeremy Rowley <[email protected] <mailto:[email protected]> > wrote: The risk is primarily outages of major sites across the web, including certs used in Google wallet. We’re thinking that is a less than desirable result, but we weren’t sure how the Mozilla community would feel/react. I don’t think that is a particularly helpful framing, to be honest. The risk these organizations face here is self-inflicted; regardless of the feeling of underscores, there is unquestionably an issue for organizations that cannot respond in the BR timeframes, let alone extended ones that extend for months. That's a real ecosystem issue, and regardless of the CA these customers partner with, an issue that needs both better understanding and, to be honest, better prevention. Matt has spoken at length to the risk to the community, which doesn’t really seem like it’s been acknowledged, let alone proposed as to how it will be mitigated. I have to ask again - what steps is DigiCert taking to avoid these issues going forward? We’re still considering revoking all of the certs on Jan 15th based on these discussions. I don’t think we’re asking for leniency (maybe we are if that’s a factor?), but I don’t know what happens if you’re faced with causing outages vs. compliance. What happens is that you ask why there is risk of outage to begin with and what can be done to improve going forward? Let’s assume you do revoke, and it causes an outage - is DigiCert taking steps to ensure no customer of theirs is ever faced with that risk? If so, what are those steps? I started the conversation because I feel like we should be good netizans and make people aware of what’s going on instead of just following policy. I’m actually surprised at least one other CA that has issued a large number of underscore character certs hasn’t run into the same timing issues. This seems to suggest that perhaps other CAs have prepared their customers for revocation. How does this surprise - that no other CA faces this - lead to tangible changes in the business processes? How would this change, if another CA did have the same issue? Surely you can see there are real and fundamental issues that you’re uniquely qualified to help your customers address in ways that we cannot. Have you analyzed CT, for example, to see why DigiCert is unique? Certainly, by sheer volume, it's heavily tilted towards the old Symantec infrastructure - and the customers that came over to DigiCert. With those sorts of details, how does this change how things were done, or how they will be done? I’m not trying to pick on y’all - I think it is legitimately good that you provided concrete data. Even if you do revoke on Jan 15, this is still useful to understand the challenges, but only if this leads to meaningful changes. What might those look like? Normally, we would just revoke the certs, but there are a significant number of certs in the Alexa top 100. We’ve told most customers, “No exception”. I also thought it’s better to get the information out there so we can all make rational decisions (DigiCert included) if as many facts are known as possible. And this is the framing that I think is incredibly helpful. Understanding why customers can’t change, and what steps are being done to ensure they can, is hugely useful. Wayne’s question were to this point - as were mine towards understanding the problem from the other side, which are steps the CA is taking. As I've repeatedly highlighted from https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation , the goal is not punishment - but understanding how these issues are being addressed. We are working with the partners to get the certs revoked before the deadline. Most will. This seems like a significant improvement from “100% of customers can’t” By January 15th, I hope there won’t be too many certs left. Unfortunately, by then it’s also too late to discuss what happens if the cert is not revoked. Ie – what are the benefits of revoking (strict compliance) vs revoking the larger impact certs as they are migrated (incident report). Unfortunately part 2, there’s no guidance on whether an incident report means total distrust v. something on your audit and a stern lecture. I mean, it’s two-fold, right? Any incident can lead to total distrust, but it’s also unlikely that a single incident leads to total distrust. The way to balance those competing statements is to do what you’re doing - and to be transparent. As Matt has highlighted, there’s a huge risk here that this leads to a moral hazard - and the best way to mitigate that is to discuss steps being taken to reduce that risk going forward, particularly about what a core part of the problem statement is - difficulty in revocation. I’d happily suffer a lecture than take down a top site. Not so willing to gamble the whole company. This is why we wanted to have the discussion now, despite no violation so far. The response from the browsers is public - that they cannot make that determination. Does that mean we have our answer? Revoke is the only acceptable response? I mean, the answer has been to repeatedly highlight https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation In a number of ways, an unintentional violation is worse than an intentional violation. Ignorance is not really an excuse when you hold keys to the Internet, and being asleep at the wheel is hugely dangerous. So, if I had to pick between an intentional violation and an unintentional (and preventable) violation, I'd likely pick intentional. But there's also a huge hazard with intentional violations - those reveal potentially systemic issues and a lack of good faith, especially if they become common-place. We definitely saw CAs perform intentional violations and notify after-the-fact, and that's far, far worse than those that notify before intentionally violating (I think every post-facto notification for intentional incident has, eventually, lead to that CAs distrust). So somewhere on the scale of things, we're in a better place than most every alternative. But to ensure this is in that 'good faith' side of things, understanding what the factors are that have been evaluated, and what steps are being taken to prevent this, are significant. As I said, I think the principles captured in https://wiki.mozilla.org/CA/Responding_To_An_Incident#Revocation and in the discussion about how at least some of us see this (that it's related to underscores incident response) suggests that it's not, in fact, the end of the world, or the CA, provided that meaningful data behind the decision to not revoke is given, meaningful plans and timelines for resolution are given, and meaningful steps to prevent this from ever happening again are given. It becomes an incident report, and the result is not a stern lecture - but concrete and quantifiable steps as to how to improve.
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ dev-security-policy mailing list [email protected] https://lists.mozilla.org/listinfo/dev-security-policy

