Hi Job,

> On 13 Jul 2021, at 12:57, Job Snijders via routing-wg <routing-wg@ripe.net> 
> wrote:
> 
> Hi,
> 
> On Mon, Jul 12, 2021 at 10:23:20AM +0200, Daniel Karrenberg wrote:
>> Natanlie pointed us to
>> https://www.ripe.net/manage-ips-and-asns/resource-management/rpki/rpki-planning-and-roadmap
>> a while ago. Among other things this says:
>> 
>> “In preparation for the improved RPKI repository architecture, the
>> distributed nature of the RRDP repository is going to be implemented using
>> containers and krill-sync that pulls data from the centralised on-premise
>> repository. This greatly simplifies smooth transitioning between publication
>> servers without any downtime.
>> 
>> NOTE: We are not referring to cloud technologies here, just to our internal
>> deployment technologies.”
>> 
>> The silence here worries me.
> 
> What silence?!
> 
> Over the last few months there have been quite some mail threads in this
> working group about RPKI and RPKI outage incidents, and NCC staff have
> provided updates during the virtual RIPE meetings in the Routing WG
> slot.
> 
> To me the roadmap seems to reflect the sentiment that reliability is the
> key objective at this moment in time.
> 
>> I would like to see some feedback from this group whether this is what
>> you want to see happening. The RIPE Routing WG is the forum for giving
>> guidance to the RIPE NCC about RPKI. I know other channels exist too
>> and that is fine. I also know that individuals here seem to be happy
>> with what is happening. However private channels and conversations are
>> not the way RIPE does this.  This group is where the RIPE NCC looks
>> for guidance and where that guidance gets properly archived and
>> responded to.
> 
> To be honest I am not sure what the purpose of krill-sync is.
> 
> In May 2021 [1] extensive testing was conducted with the help of the
> NLNOG RING to see if krill-sync could be used to power the RSYNC
> service, but it turned out there were multiple issues with krill-sync
> making it a suboptimal choice. I believe RIPE NCC ended up deploying a
> different solution to serve RSYNC - and my hope is that the
> recently-achieved stability is here to stay, because the current setup
> seems to work quite nicely.

We are [1] evaluating krill-sync as a tool to build rsync servers that are
independent of NFS and can use cached IO.

The reason for this is rsync fallback. We see ~139 RPs using the rsync
repository (as well as the majority of the NLNOG RING nodes ) and >1600 RPs
using the RRDP repository [2]. When rsync fallback happens for many RPs, the
current infrastructure will likely not scale, even when each RP starts from the
last RRDP state.

We are evaluating krill-sync because it allows us to build a rsync repository
from RRDP and is available as an open-source project.

I recall that while evaluating that krill-sync based environment we found three
issues:
  * Repository versions need to be available for two hours _after they last were
    the current version_ to give slow clients the chance to retrieve them [3].
  * The modification time of objects needs to be the same (between nodes and
    between copies for a serial) to prevent additional IOs for RPs.
  * There are very slow outliers reading repositories, but keeping versions
    available for two hours is long enough in practice.

Finding these issues was good: it ensured that they were accounted for in our
implementation that writes to NFS. After reporting the relevant issues upstream
they have been fixed in krill-sync. The use of NLNOG RING helped verify the
current NFS based setup - which I agree is working nicely.

Kind regards,
Ties

[1]: https://www.ripe.net/ripe/mail/archives/routing-wg/2021-June/004351.html
[2]: rsync: number of unique IPs reading from /repository yesterday in one 
hour. hour-to-hour variance is minimal. RRDP: number of unique IPs retrieving 
notification.xml >24 times/day in early July.
[3]: Example: revision 0 gets published at 0h0m, revision 1 at 1h59m, revision 
2 at 2h01m (and revision 0 is deleted). The files that clients that connect at 
1h58m read get deleted.


Reply via email to