Thank you for this comprehensive incident report Ryan. Your team's decision
to improve the documentation around the right address for reporting is
great to see! I wonder if it might also make sense to pull the contact
information directly on https://pki.goog above the fold?

-Paul (reaperhulk)

On February 22, 2018 at 12:53:32 PM, Ryan Hurst via dev-security-policy (
[email protected]) wrote:

I wanted to follow up with our findings and a summary of this issue for the
community.

Bellow you will see a detail on what happened and how we resolved the
issue, hopefully this will help explain what hapened and potentially others
not encounter a similar issue.

Summary
-------
January 19th, at 08:40 UTC, a code push to improve OCSP generation for a
subset of the Google operated Certificate Authorities was initiated. The
change was related to the packaging of generated OCSP responses. The first
time this change was invoked in production was January 19th at 16:40 UTC.

NOTE: The publication of new revocation information to all geographies can
take up to 6 hours to propagate. Additionally, clients and middle-boxes
commonly implement caching behavior. This results in a large window where
clients may have begun to observe the outage.

NOTE: Most modern web browsers “soft-fail” in response to OCSP server
availability issues, masking outages. Firefox, however, supports an
advanced option that allows users to opt-in to “hard-fail” behavior for
revocation checking. An unknown percentage of Firefox users enable this
setting. We believe most users who were impacted by the outage were these
Firefox users.

About 9 hours after the deployment of the change began (2018-01-20 01:36
UTC) a user on Twitter mentions that they were having problems with their
hard-fail OCSP checking configuration in Firefox when visiting Google
properties. This tweet and the few that followed during the outage period
were not noticed by any Google employees until after the incident’s
post-mortem investigation had begun.

About 1 day and 22 hours after the push was initiated (2018-01-21 15:07
UTC), a user posted a message to the mozilla.dev.security.policy mailing
list where they mention they too are having problems with their hard-fail
configuration in Firefox when visiting Google properties.

About two days after the push was initiated, a Google employee discovered
the post and opened a ticket (2018-01-21 16:10 UTC). This triggered the
remediation procedures, which began in under an hour.

The issue was resolved about 2 days and 6 hours from the time it was
introduced (2018-01-21 22:56 UTC). Once Google became aware of the issue,
it took 1 hour and 55 minutes to resolve the issue, and an additional 4
hours and 51 minutes for the fix to be completely deployed.

No customer reports regarding this issue were sent to the notification
addresses listed in Google's CPSs or on the repository websites for the
duration of the outage. This extended the duration of the outage.

Background
----------
Google's OCSP Infrastructure works by generating OCSP responses in batches,
with each batch being made up of the certificates issued by an individual
CA.

In the case of GIAG2, this batch is produced in chunks of certificates
issued in the last 370 days. For each chunk, the GIAG2 CA is asked to
produce the corresponding OCSP responses, the results of which are placed
into a separate .tar file.

The issuer of GIAG2 has chosen to issue new certificates to GIAG2
periodically, as a result GIAG2 has multiple certificates. Two of these
certificates no longer have unexpired certificates associated with them. As
a result, and as expected, the CA does not produce responses for the
corresponding periods.

All .tar files produced during this process are then concatenated with the
-concatenate command in GNU tar. This produces a single .tar file
containing all of the OCSP responses for the given Certificate Authority,
then this .tar file is distributed to our global CDN infrastructure for
serving.

A change was made in how we batch these responses, specifically instead of
outputting many .tar files within a batch, a concatenation was of all tar
files was produced.

The change in question triggered an unexpected behaviour in GNU tar which
then manifested as an empty tarball. These "empty" updates ended up being
distributed to our global CDN, effectively dropping some responses, while
continuing to serve responses for other CAs.

During testing of the change, this behaviour was not detected, as the tests
did not cover the scenario in which some chunks did not contain unexpired
certificates.

Findings
--------
- The outage only impacted sites with TLS certificates issued by the GIAG2
CA as it was the only CA that met the required pre-conditions of the bug.
- The bug that introduced this failure manifested itself as an empty
container of OCSP responses. The root cause of the issue was an unexpected
behavior of GNU tar relating to concatenating tar files.
- The outage was observed by revocation service monitoring as “unknown
certificate” (HTTP 404) errors. HTTP 404 errors are expected in OCSP
responder operations; they typically are the result of poorly configured
clients. These events are monitored and a threshold does exist for an
on-call escalation.
- Due to a configuration error the designated Google team did not receive
an escalation message.
- External users did not use the contact details Google provided in the CPS.

Remediation Plan
----------------
- A bug fix has been applied to prevent the same issue from happening again.
- Test cases looking for a minimum number of OCSP responses in each tar
were added to the test automation suites to catch similar issues in the
future.
- The monitoring system that was misconfigured was updated to use the
correct address for escalations.
- Both the Google Trust Services CPS (found on pki.goog) and the Google CPS
(found on pki.google.com) have been updated to make it clear what email
address is the most expedient path to reach the PKI team for non-security
incidents.
- The Google PKI repository page was updated to show contact details in the
same way the Google Trust Services repository page already did in a hope to
help users find a path of escalation.
- The wizard that is returned for mails to the security email address has
been updated to also include an explicit option for issues related to the
“Google Certificate Authority” in the hopes of helping users who choose
this path of escalation.
- Existing procedures that are relied upon for periodic verification of
effective escalation have been updated to include unknown certificate
checking.

_______________________________________________
dev-security-policy mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-security-policy
_______________________________________________
dev-security-policy mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-security-policy

Reply via email to