Dear colleagues,

First, let me give you an overview of our rsync infrastructure and the situation
encountered by a client. Afterwards, I will describe the context of our
application and repository, and how that limits our design space.

RPKI objects are created on machines in an isolated network. The active machine
writes new objects to an NFS share (with replicated storage in two data 
centres).
The rsync machines (outside the isolated network) serve these files. These are
behind a load balancer.

Sets of objects to be updated (for example a manifest, CRL, and certificate) are
written to a staging directory by the application. After all the files are
created, they are moved into the main repository directory. There is a small
period between these moves. In a recent incident that was reported to us, this
was ~30ms, with files written in an order where the repository was correct on
disk at all times. This part of the code has been in place since 2012.
While the files are written to the filesystem they are also sent to a (draft
version of RFC 8181) publication server. The files are sent atomically in one
message. The publication is synchronous: when a ROA is created, it is 
immediately
published to rsync and the publication server.

The affected client in the reported incident read the file list *before* the new
certificate was present, but checked the content (and copied) the updated
manifest which referred to a certificate that was not present when the file list
was created. In the rest of this document, we will call this situation a
non-repeatable read; part of the repository reflects one state while another
part reflects a different state.

On April 12, we published 41,600 sets of objects. This resulted in 41,600
distinct repository states on disk. The RIPE NCC repository contains ~65,500
files in ~34,500 directories, with a total size of 157MB.

The repository is consistent (on disk) when the application is not publishing
objects. The repository is also consistent (for a slow client) when no files are
added or changed after their rsync client starts retrieving the file list.

Copying the repository without coordination from the application (i.e. to spool
it) has the same risk of a non-repeatable read as rsync clients have. However,
in this case, it would affect many clients for an extended period - and mask
instead of solve the underlying issue. Other approaches (such as snapshotting)
also have limitations that make them untenable.

The RPKI application does not support writing the complete repository to disk
for each state (as needed for spooling the repository as proposed in scripts).
Synchronously writing every state of the repository to disk is not feasible,
given our update frequency and repository size. Functionality for
asynchronously writing the repository to disk needs to be developed. We have two
paths to develop this:
- The first is a new daemon that writes to disk from the database state at a 
set interval.
- The second one is using RRDP as a source of truth and writing the repository 
to disk.
Furthermore, we would need to migrate the storage from NFS to have faster 
writes.

Both approaches need an extended period for validation and we are not able to
deploy these within a few weeks. The latter approach (using RRDP) has less risk
and is the option we are aiming for at the moment. We plan to release the new
publication infrastructure in Q2/Q3 2021 and hope to migrate earlier.

I’m happy to answer any further questions you may have.

Kind regards,

Ties de Kock
RIPE NCC


> On 12 Apr 2021, at 15:12, Nick Hilliard <[email protected]> wrote:
> 
> Erik Bais wrote on 12/04/2021 11:41:
>> This looks to be a 3 line bash script fix on a cronjob …  So why isn’t this 
>> just tested on a testbed and updated before the end of the week ?
> 
> cache coherency and transaction guarantees are problems which are known to be 
> difficult to solve.  Long term, the RIPE NCC probably needs to aim for the 
> degree of transaction read consistency that might be provided by an ACID 
> database or something equivalent, and that will take time and probably a 
> migration away from a filesystem-based data store.
> 
> So the question is what to do in the interim?  The bash script that Job 
> posted may help to reduce some of the race conditions that are being 
> observed, but it's unlikely to guarantee transaction consistency in a deep 
> sense.  Does the RIPE NCC have an opinion on whether the approach used in 
> this script would help with some of the problems that are being seen and if 
> so, would there be any significant negative effects associated with 
> implementing it as an intermediate patch-up?
> 
> Nick
> 


Reply via email to