Re: use TCP_NODELAY on TCP sockets?
On Thu, May 16, 2024 at 12:55:38PM +0200, Ondrej Zajicek wrote: > Yeah, i think that using TCP_NODELAY for BGP makes sense, considering > there is already non-trivial framing and we write individual BGP > messages with one write(). I suspect it might be better for throughput to send multiple BGP messages with a single writev() or sendmsg() call. Kind regards, Job
Re: use TCP_NODELAY on TCP sockets?
On Wed, May 15, 2024 at 09:09:47PM +0200, Erin Shepherd wrote: > It seems absent from the BSDs, but on Linux you can pass the MSG_MORE > flag to send() to override TCP_NODELAY for a specific write Am I understanding correctly this is a variant on TCP_NOPUSH/TCP_CORK? "more data is coming, dont push the send button yet!" In OpenBGPD, TCP_NODELAY is set on the socket (a socket option available on all platforms, I think?), and then all data is coalesced into sendmsg(), no need for 'corking'. From my limited testing it seems a full routing table should fit in ~ TCP 41,000 packets. BIRD has a code path sk_sendmsg()->sendmsg() called from sk_maybe_write(); but based my limited testing I'm not sure this path is followed in all cases, because I see way more than 41K packets for a full table feed (with TCP_NODELAY enabled). Perhaps there are two separate questions here: - are BGP messages (slightly) delayed because of TCP_NODELAY not being set? (I think yes) - are BGP messages as efficiently coalesced into as few TCP packets as possible? (with TCP_NODELAY set, I am not sure) Kind regards, Job ps. To clarify why I started this thread: last week I fell into the TCP subsystem rabbit hole: why are things the way they are? I started auditing various programs related to my $dayjob and thought it would be good to open a conversation with the BIRD developer community. My goal is not necessarily to get this patch 'as-is' merged, but to learn from and with friendly and respected BGP developers.
Re: use TCP_NODELAY on TCP sockets?
Dear Marco, On Wed, 15 May 2024 at 19:27, Marco d'Itri wrote: > On May 15, Job Snijders via Bird-users wrote: > > > But please be very careful in considering this patch, because it does > > introduce some subtle changes in the on-the-wire behaviour of BIRD. For > > example, without this patch an UPDATE for a handful of routes and the > > End-of-RIB marker might end up in the same TCP packet (if this fits); > > but with this patch, the End-of-RIB marker ends up in its own TCP > > packet. As things are today, setting TCP_NODELAY will increase the > I think that this can be fixed easily with TCP_CORK. > Basically, with TCP_NODELAY + TCP_CORK you can have the optimal balance > of latency and overhead without using writev. Yes, thanks for sharing this suggestion. Note that TCP_CORK is a Linux-only feature (on FreeBSD it seems aliased to TCP_NOPUSH, but I might be misunderstanding that code). There are subtle differences between NOPUSH and CORK: resetting CORK triggers a flush, whereas resetting NOPUSH does not. It is of course reasonable to optimize one platform and not others, we work with the tools that we have - but a portable approach (using writev()?) would seem attractive to me :-) Kind regards, Job >
use TCP_NODELAY on TCP sockets?
Dear BIRD people, On most systems RFC 896 TCP congestion control is used, also known as "Nagle's algorithm". This algorithm is intended to help coalesce consecutive small packets from userland applications (like BIRD) into a single larger TCP packet. The idea being it reduces bandwidth because there is less TCP overhead if data is bundled into fewer packets. This happens at the cost of an increase in end-to-end latency: the sender will locally queue up data (in the operating system's TCP buffers) until it either receives a TCP Ack from the remote side for all previously sent data, or sufficient additional data piled up to send out a full-sized TCP segment. In my simple testing setup (announce 255 routes to a peer) with this patch it takes ~ 0.007478 seconds after SYN to send the UPDATE message out. Without this patch, it takes ~ 0.206955 seconds to send the UPDATE out (perhaps because of negative interaction between Nagle's algorithm and Delayed Acks?). I think using TCP_NODELAY is interesting to consider, because it seems sensible to try to deliver BGP messages as fast as possible. OpenBGPD and FRR set the TCP_NODELAY socket option. But please be very careful in considering this patch, because it does introduce some subtle changes in the on-the-wire behaviour of BIRD. For example, without this patch an UPDATE for a handful of routes and the End-of-RIB marker might end up in the same TCP packet (if this fits); but with this patch, the End-of-RIB marker ends up in its own TCP packet. As things are today, setting TCP_NODELAY will increase the average number of packets send to peers. Packing multiple BGP messages into fewer TCP frames could perhaps be reintroduced by using writev() instead of write(), when it is known a bunch of messages have been queued up for a peer. Another complex behaviour is that the effect of the Nagle algorithm depends on latency between local system and remote system. Nagle's algorithm makes it so that data is locally buffered as long as the remote side has not yet acked all previously sent data; this means that proportional to the distance (latency) chances of message coalescence increase. Without this patch BIRD will send more individual TCP packets to peers that are close by compared to peers that are far away. But with TCP_NODELAY applied, all peers will receive the same number of TCP packets, regardless of latency. Look forward to hear your thoughts! Kind regards, Job sysdep/unix/io.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/sysdep/unix/io.c b/sysdep/unix/io.c index 9b499020..f9746cb7 100644 --- a/sysdep/unix/io.c +++ b/sysdep/unix/io.c @@ -946,7 +946,7 @@ sock_new(pool *p) static int sk_setup(sock *s) { - int y = 1; + int y = 1, nodelay = 1; int fd = s->fd; if (s->type == SK_SSH_ACTIVE) @@ -1048,6 +1048,10 @@ sk_setup(sock *s) return -1; } + if ((s->type == SK_TCP_PASSIVE) || (s->type == SK_TCP_ACTIVE)) +if (setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, , sizeof(nodelay)) < 0) + ERR("TCP_NODELAY"); + /* Must be after sk_set_tos4() as setting ToS on Linux also mangles priority */ if (s->priority >= 0) if (sk_set_priority(s, s->priority) < 0)
Re: Large communities indicating RPKI VALID status
On Mon, 29 Apr 2024 at 21:27, Nigel Kukard via Bird-users < bird-users@network.cz> wrote: > Hi there Richard, > > On 4/29/24 19:14, Richard Laager wrote: > > Perhaps I am naive, but I assumed one would validate RPKI on the eBGP edge > and simply reject INVALID routes. > > Why would one want to accept INVALID at all? > > If we agree one would reject INVALID, then what is left to tag? > > For my specific use case I wanted to add a community for VALID and > UNKNOWN. I'm going to look into the non-transitive extended communities to > see how this works out. > Sure, but why add such communities? It reduces performance and doesn’t add security benefits. OTOH - it can satisfy curiosity about where traffic is flowing - then again, using a traffic analyser like pmacct or Kentik helps offer insight how much traffic is going to Valid vs Not-Found destinations, without the need to add any communities. I’m not saying you shouldn’t pursue adding a few non-transitive extended communities here and there for your use case; just that generally speaking, operators probably should not apply different policies for Valid and Not-Found states. Kind regards, Job >
Re: Large communities indicating RPKI VALID status
On Sat, Apr 27, 2024 at 03:00:45PM +0200, Ondrej Zajicek via Bird-users wrote: > On Sat, Apr 27, 2024 at 08:18:18AM +0200, Daniel Suchy via Bird-users wrote: > > There's internet draft describing in detail, why it's not a good idea to > > store RPKI validation state inside community variables at all.. > > > > https://www.ietf.org/archive/id/draft-ietf-sidrops-avoid-rpki-state-in-bgp-00.html > > Well, note that this draft is primarily about not announcing validation > state in transitive attributes to the whole internet. Yes > But there are good reasons for having validation state in > non-transitive BGP attributes (or communities properly filtered out on > AS egress). Validating RPKI and marking routes at AS ingress ensures > that all routers within AS have consistent routing state and thus > avoiding routing loops. Perhaps I am missing something, but how does marking in itself help avoid routing loops? I am under the impression that a requirement for intra-AS transitive marking follows from non-standard modifying of non-transitive local attributes (for example, 'weight' or 'preference') based on the validation state - which is not what I'd recommend to begin with. Merely rejecting RPKI-invalid routes at the AS ingress and *not* manipulating any other local or transitive path attributes will also ensure a consistent routing state. > Unfortunately large communities do not have transitive flag like > extended ones > so perhaps it would make sense to use extended community for this > purpose. Or perhaps there should be well-known extended community for > that ... https://www.rfc-editor.org/rfc/rfc8097.html ? Kind regards, Job
Re: [patch] add 'source address' configuration option to RPKI protocols
On Thu, Feb 22, 2024 at 03:17:52PM +0100, Ondrej Zajicek wrote: > On Wed, Feb 21, 2024 at 07:14:18PM +0100, Job Snijders via Bird-users wrote: > > I'd like to be able to explicitly configure the source IP address > > for RPKI-To-Router sessions. Predictable source addresses are useful > > for minimizing the holes to be poked in ACLs. The below changeset > > adds a 'source address' configuration option to RPKI protocols. > > Thanks, merged: > > https://gitlab.nic.cz/labs/bird/-/commit/e2728c8078161d9811d6c24a11e4c95efd1c9313 > > I just changed the name from 'source address' to 'local address'. Thank you! And kudos to the project, it was nice to observe how the pre-existing infrastructure (like bird sockets) made this is a trivial excercise. :) Kind regards, Job
[patch] add 'source address' configuration option to RPKI protocols
Dear BIRD team, Greetings from Amsterdam! I'd like to be able to explicitly configure the source IP address for RPKI-To-Router sessions. Predictable source addresses are useful for minimizing the holes to be poked in ACLs. The below changeset adds a 'source address' configuration option to RPKI protocols. Kind regards, Job diff --git doc/bird.sgml doc/bird.sgml index 76ca7f75..a271d47e 100644 --- doc/bird.sgml +++ doc/bird.sgml @@ -5700,6 +5700,7 @@ protocol rpki [name] { refresh [keep] num; retry [keep] num; expire [keep] num; +source address ip; transport tcp; transport ssh { bird private key "/path/to/id_rsa"; @@ -5753,6 +5754,9 @@ specify both channels. instead. This may be useful for implementing loose RPKI check for blackholes. Default: disabled. +source address +Define local address we should use as a source address for the RTR session. + transport tcp Unprotected transport over TCP. It's a default transport. Should be used only on secure private networks. Default: tcp diff --git proto/rpki/config.Y proto/rpki/config.Y index c28cab7a..31656057 100644 --- proto/rpki/config.Y +++ proto/rpki/config.Y @@ -32,7 +32,7 @@ rpki_check_unused_transport(void) CF_DECLS CF_KEYWORDS(RPKI, REMOTE, BIRD, PRIVATE, PUBLIC, KEY, TCP, SSH, TRANSPORT, USER, - RETRY, REFRESH, EXPIRE, KEEP, IGNORE, MAX, LENGTH) + RETRY, REFRESH, EXPIRE, KEEP, IGNORE, MAX, LENGTH, SOURCE, ADDRESS) %type rpki_keep_interval @@ -60,6 +60,7 @@ rpki_proto_item: | REMOTE rpki_cache_addr | REMOTE rpki_cache_addr rpki_proto_item_port | rpki_proto_item_port + | SOURCE ADDRESS ipa { RPKI_CFG->local_ip = $3; } | TRANSPORT rpki_transport | REFRESH rpki_keep_interval expr { if (rpki_check_refresh_interval($3)) diff --git proto/rpki/rpki.h proto/rpki/rpki.h index 8a5c38fd..e67eb0e3 100644 --- proto/rpki/rpki.h +++ proto/rpki/rpki.h @@ -116,6 +116,7 @@ struct rpki_proto { struct rpki_config { struct proto_config c; const char *hostname;/* Full domain name or stringified IP address of cache server */ + ip_addr local_ip;/* Source address to use */ ip_addr ip; /* IP address of cache server or IPA_NONE */ u16 port;/* Port number of cache server */ struct rpki_tr_config tr_config; /* Specific transport configuration structure */ diff --git proto/rpki/transport.c proto/rpki/transport.c index 81bd6dd8..26571977 100644 --- proto/rpki/transport.c +++ proto/rpki/transport.c @@ -82,6 +82,7 @@ rpki_tr_open(struct rpki_tr_sock *tr) sk->daddr = cf->ip; sk->dport = cf->port; sk->host = cf->hostname; + sk->saddr = cf->local_ip; sk->rbsize = RPKI_RX_BUFFER_SIZE; sk->tbsize = RPKI_TX_BUFFER_SIZE; sk->tos = IP_PREC_INTERNET_CONTROL;
[patch] SendHoldTimer BGP Error code
Dear all, IANA registered an "Early Allocation" BGP Error code for draft-ietf-idr-bgp-sendholdtimer, see https://www.iana.org/assignments/bgp-parameters/bgp-parameters.xhtml#bgp-parameters-3 The below changeset aligns bird with IANA's Border Gateway Protocol (BGP) Parameters registry. Kind regards, Job diff --git proto/bgp/bgp.c proto/bgp/bgp.c index e97b45e6..bd6e90d6 100644 --- proto/bgp/bgp.c +++ proto/bgp/bgp.c @@ -1059,12 +1059,13 @@ bgp_send_hold_timeout(timer *t) struct bgp_conn *conn = t->data; struct bgp_proto *p = conn->bgp; + DBG("BGP: Send hold timeout\n"); + if (conn->state == BS_CLOSE) return; - /* Error codes not yet assigned by IANA */ - uint code = 4; - uint subcode = 1; + uint code = 8; + uint subcode = 0; /* Like bgp_error() but without NOTIFICATION */ bgp_log_error(p, BE_BGP_TX, "Error", code, subcode, NULL, 0); diff --git proto/bgp/packets.c proto/bgp/packets.c index 18a226fb..8320991a 100644 --- proto/bgp/packets.c +++ proto/bgp/packets.c @@ -3275,7 +3275,6 @@ static struct { { 3, 10, "Invalid network field" }, { 3, 11, "Malformed AS_PATH" }, { 4, 0, "Hold timer expired" }, - { 4, 1, "Send hold timer expired" }, /* Provisional [draft-ietf-idr-bgp-sendholdtimer] */ { 5, 0, "Finite state machine error" }, /* Subcodes are according to [RFC6608] */ { 5, 1, "Unexpected message in OpenSent state" }, { 5, 2, "Unexpected message in OpenConfirm state" }, @@ -3290,7 +3289,8 @@ static struct { { 6, 7, "Connection collision resolution" }, { 6, 8, "Out of Resources" }, { 7, 0, "Invalid ROUTE-REFRESH message" }, /* [RFC7313] */ - { 7, 1, "Invalid ROUTE-REFRESH message length" } /* [RFC7313] */ + { 7, 1, "Invalid ROUTE-REFRESH message length" }, /* [RFC7313] */ + { 8, 0, "Send hold timer expired" } /* [draft-ietf-idr-bgp-sendholdtimer] */ }; /**
Re: Overloading RTR to load IRR (Was: Defines for mixed IPv6/IPv4)
On Thu, Jan 25, 2024 at 11:55:14AM +0100, Erin Shepherd wrote: > Spitballing slightly here, but could you avoid this problem by adding > 0.0.0.0/0+ ::0/0+ AS0 RoAs to the table and accepting ROA Unknowns? > > Obviously the disadvantage here is that if your IRR RTR server goes > down you're basically unfiltered, but it at least avoids the > availability problem That's an interesting approach, thanks for sharing this. Kind regards, Job
Re: Overloading RTR to load IRR (Was: Defines for mixed IPv6/IPv4)
On Thu, Jan 25, 2024 at 11:41:25AM +0100, Job Snijders wrote: > On Thu, Jan 25, 2024 at 11:13:51AM +0100, Jeroen Massar via Bird-users wrote: > > a quick stab at generating the slurm file: > > why use SLURM though? SLURM doesn't have a 'maxLength' field like the > regular JSON input formatted in this style has: > https://console.rpki-client.org/rpki.json - which might help with > aggregation. Err... scratch that, SLURM does have a field for that, but it's called "maxPrefixLength". Anyway, using the 'normal' JSON format as input for the IRR RTR server should make for a more concise smaller JSON file compared to using SLURM prefixAssertions, but both could be used. Kind regards, Job
Overloading RTR to load IRR (Was: Defines for mixed IPv6/IPv4)
On Thu, Jan 25, 2024 at 11:13:51AM +0100, Jeroen Massar via Bird-users wrote: > a quick stab at generating the slurm file: why use SLURM though? SLURM doesn't have a 'maxLength' field like the regular JSON input formatted in this style has: https://console.rpki-client.org/rpki.json - which might help with aggregation. More importantly, a risk I perceive with overloading RTR functionality to load IRR data into routers is in the realm of fail-safes: For RPKI-derived data, most ISPs do something along the lines of: if (roa_check(rpki, net, bgp_path.last) = ROA_INVALID) then reject; For IRR-derived data, you'd have to do: if (roa_check(irr, net, peerasn) != ROA_VALID) then reject; The above means that suddenly your EBGP routers/route servers have a very hard dependency on the IRR RTR session being up in order to accept routes (fail closed), whereas how the RPKI-derived data is used is in a 'fail open' fashion. The above friction goes back to RPKI ROAs being defined as "if ROAs exist, all BGP routes that do not match any of the ROAs are invalid" (following the RFC 6811 algorithm), but for IRR route/route6 objects such a definition was never established, those predate the RFC 6811 algorithm. Kind regards, Job
Re: Expiration for ROA tables until when the VRP is valid?
On Tue, Mar 07, 2023 at 01:00:40PM +0100, Job Snijders wrote: > On Tue, Mar 07, 2023 at 12:52:16PM +0100, Ondrej Zajicek wrote: > > If i understand it correctly, it is relevant just for static ROA > > records? > > Correct > > > I assume these expiration records are based on wall-clock time instead > > of relative time? > > Correct, wall-clock time expressed as number of seconds that have > elapsed since 00:00:00 UTC on 1 January 1970 (Unix time). > > > It is a question whether we should handle expiration of such static routes > > properly / dynamically, or just a one-time check during reconfiguration. > > That would be order of magnitude simpler, but it is also a thing that > > could be done by a trivial script preprocessing the included config file > > with static ROA records. > > For what its worth: OpenBGPD and StayRTR handle it 'dynamically', but > not in absolute real time: both implementations walk a table every few > (~3) minutes to check for newly expired entries. To follow up with a mock configuration example, please see below. The %lld after keyword 'expires' is calculated by OpenBSD's rpki-client by walking the chain of authorities and keeping tabs on the 'soonest' expiration moment. In other words, the 'expires' value is the moment a RTR server would send an IPv4/IPv6 Prefix RTR PDU with the 'withdraw' flag set. (rpki-client supports generating static ROA configurations in BIRD's format, if BIRD supports 'expires' I'll update rpki-client/output-bird.c) For a constantly up-to-date set of data, please download https://console.rpki-client.org/vrps.json (as often as you want), and transform it into BIRD's configuration format. The 'expires' key in the JSON objects is what one would use to instruct the BGP daemon or RTR server when to expire. I am happy to help with this project. Kind regards, Job roa4 table ROAS4; roa6 table ROAS6; protocol static { roa4 { table ROAS4; }; route 1.0.0.0/24 max 24 as 13335 expires 1678717104; # Mon Mar 13 14:18:24 UTC 2023 route 1.0.4.0/24 max 24 as 38803 expires 1678751858; # Mon Mar 13 23:57:38 UTC 2023 route 1.7.142.0/24 max 24 as 132215 expires 1678789300; # Tue Mar 14 10:21:40 UTC 2023 # ... # left the thousands of other static entries out for brevity } protocol static { roa6 { table ROAS6; }; route 2001:200::/32 max 32 as 2500 expires 1678724530; route 2001:200:136::/48 max 48 as 9367 expires 1678724530; route 2001:200:1ba::/48 max 48 as 24047 expires 1678724530; # ... # left the thousands of other static entries out for brevity }
Re: Expiration for ROA tables until when the VRP is valid?
On Tue, Mar 07, 2023 at 12:52:16PM +0100, Ondrej Zajicek wrote: > If i understand it correctly, it is relevant just for static ROA > records? Correct > I assume these expiration records are based on wall-clock time instead > of relative time? Correct, wall-clock time expressed as number of seconds that have elapsed since 00:00:00 UTC on 1 January 1970 (Unix time). > It is a question whether we should handle expiration of such static routes > properly / dynamically, or just a one-time check during reconfiguration. > That would be order of magnitude simpler, but it is also a thing that > could be done by a trivial script preprocessing the included config file > with static ROA records. For what its worth: OpenBGPD and StayRTR handle it 'dynamically', but not in absolute real time: both implementations walk a table every few (~3) minutes to check for newly expired entries. Kind regards, Job
Re: Expiration for ROA tables until when the VRP is valid?
On Tue, Mar 07, 2023 at 01:01:36AM +0100, Robert Scheck wrote: > On Sun, 19 Sep 2021, Robert Scheck wrote: > > rpki-client recently implemented the "expires" instruction for roa-sets > > that OpenBGPD provides [1][2]. As of writing, BIRD does not seem to have > > something similar...any chance for it? From my understanding this only > > applies to included ROA files with VRP, not to RTR. > > > > [1] https://man.openbsd.org/bgpd.conf#roa-set > > [2] > > https://github.com/rpki-client/rpki-client-openbsd/commit/7bf63da6ec80f37bd72dbab99a5a71cee5707dc2 > > Please let me kindly repeat my initial question from about 1.5 years ago: > Is there any chance for getting this feature into BIRD, too? Job provided > some more details and insights as part of the original thread: > > - https://bird.network.cz/pipermail/bird-users/2021-September/015725.html > - https://bird.network.cz/pipermail/bird-users/2021-September/015726.html Related, RPKI-To-Router implementation StayRTR recently received support for honoring configured expiration timers for individual RPKI VRPs. [1] When the expiration moment (noted as a unix timestamp) of a given RPKI ROA/VRP has arrived, the StayRTR daemon will emit RTR Withdraws towards its clients for that ROA/VRP. Indeed, OpenBGPD (when loading VRPs from on-disk configuration) also supports similar functionality, which has proven to make various deployment scenarios less prone to faults in configuration pipelines. I'd love it if BIRD also allows operators to specify the expiration moment of a given ROA/VRP in the on-disk configuration through a keyword + timestamp. Kind regards, Job [1]: https://github.com/bgp/stayrtr/commit/13659dd27e1b792dd2a7b9f439ef0a4159d862d9