Tim,

I can see value in a ballot on how to clarify incident reporting and other
contact related issues, right now 1.5.2 is pretty sparse in regards to how
to handle this. I would be happy to work with you on a proposal here.

Ryan

On Sun, Feb 25, 2018 at 6:41 AM, Tim Hollebeek <[email protected]>
wrote:

> Ryan,
>
> Wayne and I have been discussing making various improvements to 1.5.2
> mandatory for all CAs.  I've made a few improvements to DigiCert's CPSs in
> this area, but things probably still could be better.  There will probably
> be
> a CA/B ballot in this area soon.
>
> DigiCert's 1.5.2 has our support email address, and our Certificate Problem
> Report email (which I recently added).  That doesn't really cover
> everything
> (yet).
>
> It looks like GTS 1.5.2 splits things into security (including CPRs),
> non-security
> requests.
>
> I didn't chase down any other 1.5.2's yet, but it'd be interesting to hear
> what
> other CAs have here.  I suspect most only have one address for everything.
>
> Something to keep in mind once the CA/B thread shows up.
>
> -Tim
>
> > -----Original Message-----
> > From: dev-security-policy [mailto:dev-security-policy-
> > [email protected]] On Behalf Of Ryan
> > Hurst via dev-security-policy
> > Sent: Wednesday, February 21, 2018 9:53 PM
> > To: [email protected]
> > Subject: Re: Google OCSP service down
> >
> > I wanted to follow up with our findings and a summary of this issue for
> the
> > community.
> >
> > Bellow you will see a detail on what happened and how we resolved the
> issue,
> > hopefully this will help explain what hapened and potentially others not
> > encounter a similar issue.
> >
> > Summary
> > -------
> > January 19th, at 08:40 UTC, a code push to improve OCSP generation for a
> > subset of the Google operated Certificate Authorities was initiated. The
> change
> > was related to the packaging of generated OCSP responses. The first time
> this
> > change was invoked in production was January 19th at 16:40 UTC.
> >
> > NOTE: The publication of new revocation information to all geographies
> can
> > take up to 6 hours to propagate. Additionally, clients and middle-boxes
> > commonly implement caching behavior. This results in a large window where
> > clients may have begun to observe the outage.
> >
> > NOTE: Most modern web browsers “soft-fail” in response to OCSP server
> > availability issues, masking outages. Firefox, however, supports an
> advanced
> > option that allows users to opt-in to “hard-fail” behavior for revocation
> > checking. An unknown percentage of Firefox users enable this setting. We
> > believe most users who were impacted by the outage were these Firefox
> users.
> >
> > About 9 hours after the deployment of the change began (2018-01-20 01:36
> > UTC) a user on Twitter mentions that they were having problems with their
> > hard-fail OCSP checking configuration in Firefox when visiting Google
> > properties. This tweet and the few that followed during the outage
> period were
> > not noticed by any Google employees until after the incident’s
> post-mortem
> > investigation had begun.
> >
> > About 1 day and 22 hours after the push was initiated (2018-01-21 15:07
> UTC),
> > a user posted a message to the mozilla.dev.security.policy mailing list
> where
> > they mention they too are having problems with their hard-fail
> configuration in
> > Firefox when visiting Google properties.
> >
> > About two days after the push was initiated, a Google employee
> discovered the
> > post and opened a ticket (2018-01-21 16:10 UTC). This triggered the
> > remediation procedures, which began in under an hour.
> >
> > The issue was resolved about 2 days and 6 hours from the time it was
> > introduced (2018-01-21 22:56 UTC). Once Google became aware of the
> issue, it
> > took 1 hour and 55 minutes to resolve the issue, and an additional 4
> hours and
> > 51 minutes for the fix to be completely deployed.
> >
> > No customer reports regarding this issue were sent to the notification
> > addresses listed in Google's CPSs or on the repository websites for the
> duration
> > of the outage. This extended the duration of the outage.
> >
> > Background
> > ----------
> > Google's OCSP Infrastructure works by generating OCSP responses in
> batches,
> > with each batch being made up of the certificates issued by an
> individual CA.
> >
> > In the case of GIAG2, this batch is produced in chunks of certificates
> issued in
> > the last 370 days. For each chunk, the GIAG2 CA is asked to produce the
> > corresponding OCSP responses, the results of which are placed into a
> separate
> > .tar file.
> >
> > The issuer of GIAG2 has chosen to issue new certificates to GIAG2
> periodically,
> > as a result GIAG2 has multiple certificates. Two of these certificates
> no longer
> > have unexpired certificates associated with them. As a result, and as
> expected,
> > the CA does not produce responses for the corresponding periods.
> >
> > All .tar files produced during this process are then concatenated with
> the -
> > concatenate command in GNU tar. This produces a single .tar file
> containing all
> > of the OCSP responses for the given Certificate Authority, then this
> .tar file is
> > distributed to our global CDN infrastructure for serving.
> >
> > A change was made in how we batch these responses, specifically instead
> of
> > outputting many .tar files within a batch, a concatenation was of all
> tar files
> > was produced.
> >
> > The change in question triggered an unexpected behaviour in GNU tar which
> > then manifested as an empty tarball. These "empty" updates ended up being
> > distributed to our global CDN, effectively dropping some responses, while
> > continuing to serve responses for other CAs.
> >
> > During testing of the change, this behaviour was not detected, as the
> tests did
> > not cover the scenario in which some chunks did not contain unexpired
> > certificates.
> >
> > Findings
> > --------
> > - The outage only impacted sites with TLS certificates issued by the
> GIAG2 CA
> > as it was the only CA that met the required pre-conditions of the bug.
> > - The bug that introduced this failure manifested itself as an empty
> container of
> > OCSP responses. The root cause of the issue was an unexpected behavior of
> > GNU tar relating to concatenating tar files.
> > - The outage was observed by revocation service monitoring as  “unknown
> > certificate” (HTTP 404) errors. HTTP 404 errors are expected in OCSP
> > responder operations; they typically are the result of poorly configured
> clients.
> > These events are monitored and a threshold does exist for an on-call
> > escalation.
> > - Due to a configuration error the designated Google team did not
> receive an
> > escalation message.
> > - External users did not use the contact details Google provided in the
> CPS.
> >
> > Remediation Plan
> > ----------------
> > - A bug fix has been applied to prevent the same issue from happening
> again.
> > - Test cases looking for a minimum number of OCSP responses in each tar
> were
> > added to the test automation suites to catch similar issues in the
> future.
> > - The monitoring system that was misconfigured was updated to use the
> > correct address for escalations.
> > - Both the Google Trust Services CPS (found on pki.goog) and the Google
> CPS
> > (found on pki.google.com) have been updated to make it clear what email
> > address is the most expedient path to reach the PKI team for non-security
> > incidents.
> > - The Google PKI repository page was updated to show contact details in
> the
> > same way the Google Trust Services repository page already did in a hope
> to
> > help users find a path of escalation.
> > - The wizard that is returned for mails to the security email address
> has been
> > updated to also include an explicit option for issues related to the
> “Google
> > Certificate Authority” in the hopes of helping users who choose this
> path of
> > escalation.
> > - Existing procedures that are relied upon for periodic verification of
> effective
> > escalation have been updated to include unknown certificate checking.
> >
> > _______________________________________________
> > dev-security-policy mailing list
> > [email protected]
> > https://clicktime.symantec.com/a/1/c7XVow9dpuj8IcTSi3RUsAZNao2vvQpjx50
> > I-L-Vues=?d=a8bGh4U_daa8sZ6NrNFYldn92rRny4FeSmGVut8w-
> > EpNntcoPemdf815YVvwKHuqoKWrFl-_FF88KvI-
> > g6MtPoT7dR8X0p7jIOiMMzFB1Oo7HjzsAY1_9lqhZrLywcjqWbk13D_p3Ll4Lsel0
> > FbCfxQg8ZRva7LmdOqP_8fxd4j4zZQZtuK1IaD6sXqMG0L7ytNcn6rF2IUFRa4Qa
> > VWZK1TzJXCjW_OddQll8kDyKRRM_ygs1cq6S-
> > igplPwN_yuWgdTc7_rIz0lzmwwvaaTuM20kuHGNPwWaFXn3pVW9313nUNiXz
> > BLAr8DV4QEgnaRqD_CLgMftm7WfKblze0HRF-
> > N45Bld6PgwdHDi2xobKs0BSWDW5tOuJmzbtPmfPvBxSTMduaXRBXTQAKl4zf1q
> > iD0rIGhSVrdmJCz9a69KaAmJjoVcwKfn9h4rwU5h2ydzQ%3D%3D&u=https%3A
> > %2F%2Flists.mozilla.org%2Flistinfo%2Fdev-security-policy
>
_______________________________________________
dev-security-policy mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-security-policy

Reply via email to