Tim, I can see value in a ballot on how to clarify incident reporting and other contact related issues, right now 1.5.2 is pretty sparse in regards to how to handle this. I would be happy to work with you on a proposal here.
Ryan On Sun, Feb 25, 2018 at 6:41 AM, Tim Hollebeek <[email protected]> wrote: > Ryan, > > Wayne and I have been discussing making various improvements to 1.5.2 > mandatory for all CAs. I've made a few improvements to DigiCert's CPSs in > this area, but things probably still could be better. There will probably > be > a CA/B ballot in this area soon. > > DigiCert's 1.5.2 has our support email address, and our Certificate Problem > Report email (which I recently added). That doesn't really cover > everything > (yet). > > It looks like GTS 1.5.2 splits things into security (including CPRs), > non-security > requests. > > I didn't chase down any other 1.5.2's yet, but it'd be interesting to hear > what > other CAs have here. I suspect most only have one address for everything. > > Something to keep in mind once the CA/B thread shows up. > > -Tim > > > -----Original Message----- > > From: dev-security-policy [mailto:dev-security-policy- > > [email protected]] On Behalf Of Ryan > > Hurst via dev-security-policy > > Sent: Wednesday, February 21, 2018 9:53 PM > > To: [email protected] > > Subject: Re: Google OCSP service down > > > > I wanted to follow up with our findings and a summary of this issue for > the > > community. > > > > Bellow you will see a detail on what happened and how we resolved the > issue, > > hopefully this will help explain what hapened and potentially others not > > encounter a similar issue. > > > > Summary > > ------- > > January 19th, at 08:40 UTC, a code push to improve OCSP generation for a > > subset of the Google operated Certificate Authorities was initiated. The > change > > was related to the packaging of generated OCSP responses. The first time > this > > change was invoked in production was January 19th at 16:40 UTC. > > > > NOTE: The publication of new revocation information to all geographies > can > > take up to 6 hours to propagate. Additionally, clients and middle-boxes > > commonly implement caching behavior. This results in a large window where > > clients may have begun to observe the outage. > > > > NOTE: Most modern web browsers “soft-fail” in response to OCSP server > > availability issues, masking outages. Firefox, however, supports an > advanced > > option that allows users to opt-in to “hard-fail” behavior for revocation > > checking. An unknown percentage of Firefox users enable this setting. We > > believe most users who were impacted by the outage were these Firefox > users. > > > > About 9 hours after the deployment of the change began (2018-01-20 01:36 > > UTC) a user on Twitter mentions that they were having problems with their > > hard-fail OCSP checking configuration in Firefox when visiting Google > > properties. This tweet and the few that followed during the outage > period were > > not noticed by any Google employees until after the incident’s > post-mortem > > investigation had begun. > > > > About 1 day and 22 hours after the push was initiated (2018-01-21 15:07 > UTC), > > a user posted a message to the mozilla.dev.security.policy mailing list > where > > they mention they too are having problems with their hard-fail > configuration in > > Firefox when visiting Google properties. > > > > About two days after the push was initiated, a Google employee > discovered the > > post and opened a ticket (2018-01-21 16:10 UTC). This triggered the > > remediation procedures, which began in under an hour. > > > > The issue was resolved about 2 days and 6 hours from the time it was > > introduced (2018-01-21 22:56 UTC). Once Google became aware of the > issue, it > > took 1 hour and 55 minutes to resolve the issue, and an additional 4 > hours and > > 51 minutes for the fix to be completely deployed. > > > > No customer reports regarding this issue were sent to the notification > > addresses listed in Google's CPSs or on the repository websites for the > duration > > of the outage. This extended the duration of the outage. > > > > Background > > ---------- > > Google's OCSP Infrastructure works by generating OCSP responses in > batches, > > with each batch being made up of the certificates issued by an > individual CA. > > > > In the case of GIAG2, this batch is produced in chunks of certificates > issued in > > the last 370 days. For each chunk, the GIAG2 CA is asked to produce the > > corresponding OCSP responses, the results of which are placed into a > separate > > .tar file. > > > > The issuer of GIAG2 has chosen to issue new certificates to GIAG2 > periodically, > > as a result GIAG2 has multiple certificates. Two of these certificates > no longer > > have unexpired certificates associated with them. As a result, and as > expected, > > the CA does not produce responses for the corresponding periods. > > > > All .tar files produced during this process are then concatenated with > the - > > concatenate command in GNU tar. This produces a single .tar file > containing all > > of the OCSP responses for the given Certificate Authority, then this > .tar file is > > distributed to our global CDN infrastructure for serving. > > > > A change was made in how we batch these responses, specifically instead > of > > outputting many .tar files within a batch, a concatenation was of all > tar files > > was produced. > > > > The change in question triggered an unexpected behaviour in GNU tar which > > then manifested as an empty tarball. These "empty" updates ended up being > > distributed to our global CDN, effectively dropping some responses, while > > continuing to serve responses for other CAs. > > > > During testing of the change, this behaviour was not detected, as the > tests did > > not cover the scenario in which some chunks did not contain unexpired > > certificates. > > > > Findings > > -------- > > - The outage only impacted sites with TLS certificates issued by the > GIAG2 CA > > as it was the only CA that met the required pre-conditions of the bug. > > - The bug that introduced this failure manifested itself as an empty > container of > > OCSP responses. The root cause of the issue was an unexpected behavior of > > GNU tar relating to concatenating tar files. > > - The outage was observed by revocation service monitoring as “unknown > > certificate” (HTTP 404) errors. HTTP 404 errors are expected in OCSP > > responder operations; they typically are the result of poorly configured > clients. > > These events are monitored and a threshold does exist for an on-call > > escalation. > > - Due to a configuration error the designated Google team did not > receive an > > escalation message. > > - External users did not use the contact details Google provided in the > CPS. > > > > Remediation Plan > > ---------------- > > - A bug fix has been applied to prevent the same issue from happening > again. > > - Test cases looking for a minimum number of OCSP responses in each tar > were > > added to the test automation suites to catch similar issues in the > future. > > - The monitoring system that was misconfigured was updated to use the > > correct address for escalations. > > - Both the Google Trust Services CPS (found on pki.goog) and the Google > CPS > > (found on pki.google.com) have been updated to make it clear what email > > address is the most expedient path to reach the PKI team for non-security > > incidents. > > - The Google PKI repository page was updated to show contact details in > the > > same way the Google Trust Services repository page already did in a hope > to > > help users find a path of escalation. > > - The wizard that is returned for mails to the security email address > has been > > updated to also include an explicit option for issues related to the > “Google > > Certificate Authority” in the hopes of helping users who choose this > path of > > escalation. > > - Existing procedures that are relied upon for periodic verification of > effective > > escalation have been updated to include unknown certificate checking. > > > > _______________________________________________ > > dev-security-policy mailing list > > [email protected] > > https://clicktime.symantec.com/a/1/c7XVow9dpuj8IcTSi3RUsAZNao2vvQpjx50 > > I-L-Vues=?d=a8bGh4U_daa8sZ6NrNFYldn92rRny4FeSmGVut8w- > > EpNntcoPemdf815YVvwKHuqoKWrFl-_FF88KvI- > > g6MtPoT7dR8X0p7jIOiMMzFB1Oo7HjzsAY1_9lqhZrLywcjqWbk13D_p3Ll4Lsel0 > > FbCfxQg8ZRva7LmdOqP_8fxd4j4zZQZtuK1IaD6sXqMG0L7ytNcn6rF2IUFRa4Qa > > VWZK1TzJXCjW_OddQll8kDyKRRM_ygs1cq6S- > > igplPwN_yuWgdTc7_rIz0lzmwwvaaTuM20kuHGNPwWaFXn3pVW9313nUNiXz > > BLAr8DV4QEgnaRqD_CLgMftm7WfKblze0HRF- > > N45Bld6PgwdHDi2xobKs0BSWDW5tOuJmzbtPmfPvBxSTMduaXRBXTQAKl4zf1q > > iD0rIGhSVrdmJCz9a69KaAmJjoVcwKfn9h4rwU5h2ydzQ%3D%3D&u=https%3A > > %2F%2Flists.mozilla.org%2Flistinfo%2Fdev-security-policy > _______________________________________________ dev-security-policy mailing list [email protected] https://lists.mozilla.org/listinfo/dev-security-policy

