Re: use TCP_NODELAY on TCP sockets?

2024-05-16 Thread Job Snijders via Bird-users
On Thu, May 16, 2024 at 12:55:38PM +0200, Ondrej Zajicek wrote:
> Yeah, i think that using TCP_NODELAY for BGP makes sense, considering
> there is already non-trivial framing and we write individual BGP
> messages with one write().

I suspect it might be better for throughput to send multiple BGP
messages with a single writev() or sendmsg() call.

Kind regards,

Job


Re: use TCP_NODELAY on TCP sockets?

2024-05-15 Thread Job Snijders via Bird-users
On Wed, May 15, 2024 at 09:09:47PM +0200, Erin Shepherd wrote:
> It seems absent from the BSDs, but on Linux you can pass the MSG_MORE
> flag to send() to override TCP_NODELAY for a specific write

Am I understanding correctly this is a variant on TCP_NOPUSH/TCP_CORK?
"more data is coming, dont push the send button yet!"

In OpenBGPD, TCP_NODELAY is set on the socket (a socket option available
on all platforms, I think?), and then all data is coalesced into
sendmsg(), no need for 'corking'. From my limited testing it seems a
full routing table should fit in ~ TCP 41,000 packets.

BIRD has a code path sk_sendmsg()->sendmsg() called from
sk_maybe_write(); but based my limited testing I'm not sure this path is
followed in all cases, because I see way more than 41K packets for a
full table feed (with TCP_NODELAY enabled).

Perhaps there are two separate questions here:

- are BGP messages (slightly) delayed because of TCP_NODELAY not being
  set? (I think yes)
- are BGP messages as efficiently coalesced into as few TCP packets as
  possible? (with TCP_NODELAY set, I am not sure)

Kind regards,

Job

ps. To clarify why I started this thread: last week I fell into the TCP
subsystem rabbit hole: why are things the way they are? I started
auditing various programs related to my $dayjob and thought it would be
good to open a conversation with the BIRD developer community. My goal
is not necessarily to get this patch 'as-is' merged, but to learn from
and with friendly and respected BGP developers.


Re: use TCP_NODELAY on TCP sockets?

2024-05-15 Thread Job Snijders via Bird-users
Dear Marco,

On Wed, 15 May 2024 at 19:27, Marco d'Itri  wrote:

> On May 15, Job Snijders via Bird-users  wrote:
>
> > But please be very careful in considering this patch, because it does
> > introduce some subtle changes in the on-the-wire behaviour of BIRD. For
> > example, without this patch an UPDATE for a handful of routes and the
> > End-of-RIB marker might end up in the same TCP packet (if this fits);
> > but with this patch, the End-of-RIB marker ends up in its own TCP
> > packet. As things are today, setting TCP_NODELAY will increase the



> I think that this can be fixed easily with TCP_CORK.
> Basically, with TCP_NODELAY + TCP_CORK you can have the optimal balance
> of latency and overhead without using writev.



Yes, thanks for sharing this suggestion.

Note that TCP_CORK is a Linux-only feature (on FreeBSD it seems aliased to
TCP_NOPUSH, but I might be misunderstanding that code). There are subtle
differences between NOPUSH and CORK: resetting CORK triggers a flush,
whereas resetting NOPUSH does not.

It is of course reasonable to optimize one platform and not others, we work
with the tools that we have - but a portable approach (using writev()?)
would seem attractive to me :-)

Kind regards,

Job

>


use TCP_NODELAY on TCP sockets?

2024-05-15 Thread Job Snijders via Bird-users
Dear BIRD people,

On most systems RFC 896 TCP congestion control is used, also known as
"Nagle's algorithm". This algorithm is intended to help coalesce
consecutive small packets from userland applications (like BIRD) into a
single larger TCP packet. The idea being it reduces bandwidth because
there is less TCP overhead if data is bundled into fewer packets.

This happens at the cost of an increase in end-to-end latency: the
sender will locally queue up data (in the operating system's TCP
buffers) until it either receives a TCP Ack from the remote side for all
previously sent data, or sufficient additional data piled up to send out
a full-sized TCP segment.

In my simple testing setup (announce 255 routes to a peer) with this
patch it takes ~ 0.007478 seconds after SYN to send the UPDATE message
out. Without this patch, it takes ~ 0.206955 seconds to send the UPDATE
out (perhaps because of negative interaction between Nagle's algorithm
and Delayed Acks?).

I think using TCP_NODELAY is interesting to consider, because it seems
sensible to try to deliver BGP messages as fast as possible. OpenBGPD
and FRR set the TCP_NODELAY socket option.

But please be very careful in considering this patch, because it does
introduce some subtle changes in the on-the-wire behaviour of BIRD. For
example, without this patch an UPDATE for a handful of routes and the
End-of-RIB marker might end up in the same TCP packet (if this fits);
but with this patch, the End-of-RIB marker ends up in its own TCP
packet. As things are today, setting TCP_NODELAY will increase the
average number of packets send to peers. Packing multiple BGP messages
into fewer TCP frames could perhaps be reintroduced by using writev()
instead of write(), when it is known a bunch of messages have been
queued up for a peer.

Another complex behaviour is that the effect of the Nagle algorithm
depends on latency between local system and remote system. Nagle's
algorithm makes it so that data is locally buffered as long as the
remote side has not yet acked all previously sent data; this means that
proportional to the distance (latency) chances of message coalescence
increase. Without this patch BIRD will send more individual TCP packets
to peers that are close by compared to peers that are far away. But with
TCP_NODELAY applied, all peers will receive the same number of TCP
packets, regardless of latency.

Look forward to hear your thoughts!

Kind regards,

Job

 sysdep/unix/io.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/sysdep/unix/io.c b/sysdep/unix/io.c
index 9b499020..f9746cb7 100644
--- a/sysdep/unix/io.c
+++ b/sysdep/unix/io.c
@@ -946,7 +946,7 @@ sock_new(pool *p)
 static int
 sk_setup(sock *s)
 {
-  int y = 1;
+  int y = 1, nodelay = 1;
   int fd = s->fd;
 
   if (s->type == SK_SSH_ACTIVE)
@@ -1048,6 +1048,10 @@ sk_setup(sock *s)
return -1;
   }
 
+  if ((s->type == SK_TCP_PASSIVE) || (s->type == SK_TCP_ACTIVE))
+if (setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, , sizeof(nodelay)) < 
0)
+  ERR("TCP_NODELAY");
+
   /* Must be after sk_set_tos4() as setting ToS on Linux also mangles priority 
*/
   if (s->priority >= 0)
 if (sk_set_priority(s, s->priority) < 0)


Re: Large communities indicating RPKI VALID status

2024-04-29 Thread Job Snijders via Bird-users
On Mon, 29 Apr 2024 at 21:27, Nigel Kukard via Bird-users <
bird-users@network.cz> wrote:

> Hi there Richard,
>
> On 4/29/24 19:14, Richard Laager wrote:
>
> Perhaps I am naive, but I assumed one would validate RPKI on the eBGP edge 
> and simply reject INVALID routes.
>
> Why would one want to accept INVALID at all?
>
> If we agree one would reject INVALID, then what is left to tag?
>
> For my specific use case I wanted to add a community for VALID and
> UNKNOWN. I'm going to look into the non-transitive extended communities to
> see how this works out.
>


Sure, but why add such communities? It reduces performance and doesn’t add
security benefits.

OTOH - it can satisfy curiosity about where traffic is flowing - then
again, using a traffic analyser like pmacct or Kentik helps offer insight
how much traffic is going to Valid vs Not-Found destinations, without the
need to add any communities.

I’m not saying you shouldn’t pursue adding a few non-transitive extended
communities here and there for your use case; just that generally speaking,
operators probably should not apply different policies for Valid and
Not-Found states.

Kind regards,

Job

>


Re: Large communities indicating RPKI VALID status

2024-04-28 Thread Job Snijders via Bird-users
On Sat, Apr 27, 2024 at 03:00:45PM +0200, Ondrej Zajicek via Bird-users wrote:
> On Sat, Apr 27, 2024 at 08:18:18AM +0200, Daniel Suchy via Bird-users wrote:
> > There's internet draft describing in detail, why it's not a good idea to
> > store RPKI validation state inside community variables at all..
> > 
> > https://www.ietf.org/archive/id/draft-ietf-sidrops-avoid-rpki-state-in-bgp-00.html
> 
> Well, note that this draft is primarily about not announcing validation
> state in transitive attributes to the whole internet.

Yes

> But there are good reasons for having validation state in
> non-transitive BGP attributes (or communities properly filtered out on
> AS egress). Validating RPKI and marking routes at AS ingress ensures
> that all routers within AS have consistent routing state and thus
> avoiding routing loops.

Perhaps I am missing something, but how does marking in itself help
avoid routing loops?

I am under the impression that a requirement for intra-AS transitive
marking follows from non-standard modifying of non-transitive local
attributes (for example, 'weight' or 'preference') based on the
validation state - which is not what I'd recommend to begin with.

Merely rejecting RPKI-invalid routes at the AS ingress and *not*
manipulating any other local or transitive path attributes will also
ensure a consistent routing state.

> Unfortunately large communities do not have transitive flag like
> extended ones
> so perhaps it would make sense to use extended community for this
> purpose. Or perhaps there should be well-known extended community for
> that ...

https://www.rfc-editor.org/rfc/rfc8097.html ?

Kind regards,

Job


Re: [patch] add 'source address' configuration option to RPKI protocols

2024-02-22 Thread Job Snijders via Bird-users
On Thu, Feb 22, 2024 at 03:17:52PM +0100, Ondrej Zajicek wrote:
> On Wed, Feb 21, 2024 at 07:14:18PM +0100, Job Snijders via Bird-users wrote:
> > I'd like to be able to explicitly configure the source IP address
> > for RPKI-To-Router sessions. Predictable source addresses are useful
> > for minimizing the holes to be poked in ACLs. The below changeset
> > adds a 'source address' configuration option to RPKI protocols.
> 
> Thanks, merged:
> 
> https://gitlab.nic.cz/labs/bird/-/commit/e2728c8078161d9811d6c24a11e4c95efd1c9313
> 
> I just changed the name from 'source address' to 'local address'.

Thank you!

And kudos to the project, it was nice to observe how the pre-existing
infrastructure (like bird sockets) made this is a trivial excercise. :)

Kind regards,

Job


[patch] add 'source address' configuration option to RPKI protocols

2024-02-21 Thread Job Snijders via Bird-users
Dear BIRD team,

Greetings from Amsterdam!

I'd like to be able to explicitly configure the source IP address for
RPKI-To-Router sessions. Predictable source addresses are useful for
minimizing the holes to be poked in ACLs. The below changeset adds a
'source address' configuration option to RPKI protocols.

Kind regards,

Job

diff --git doc/bird.sgml doc/bird.sgml
index 76ca7f75..a271d47e 100644
--- doc/bird.sgml
+++ doc/bird.sgml
@@ -5700,6 +5700,7 @@ protocol rpki [name] {
 refresh [keep] num;
 retry [keep] num;
 expire [keep] num;
+source address ip;
 transport tcp;
 transport ssh {
 bird private key "/path/to/id_rsa";
@@ -5753,6 +5754,9 @@ specify both channels.
instead. This may be useful for implementing loose RPKI check for
blackholes. Default: disabled.
 
+source address 
+Define local address we should use as a source address for the RTR 
session.
+
 transport tcp Unprotected transport over TCP. It's a default
 transport. Should be used only on secure private networks.
 Default: tcp
diff --git proto/rpki/config.Y proto/rpki/config.Y
index c28cab7a..31656057 100644
--- proto/rpki/config.Y
+++ proto/rpki/config.Y
@@ -32,7 +32,7 @@ rpki_check_unused_transport(void)
 CF_DECLS
 
 CF_KEYWORDS(RPKI, REMOTE, BIRD, PRIVATE, PUBLIC, KEY, TCP, SSH, TRANSPORT, 
USER,
-   RETRY, REFRESH, EXPIRE, KEEP, IGNORE, MAX, LENGTH)
+   RETRY, REFRESH, EXPIRE, KEEP, IGNORE, MAX, LENGTH, SOURCE, ADDRESS)
 
 %type  rpki_keep_interval
 
@@ -60,6 +60,7 @@ rpki_proto_item:
  | REMOTE rpki_cache_addr
  | REMOTE rpki_cache_addr rpki_proto_item_port
  | rpki_proto_item_port
+ | SOURCE ADDRESS ipa { RPKI_CFG->local_ip = $3; }
  | TRANSPORT rpki_transport
  | REFRESH rpki_keep_interval expr {
  if (rpki_check_refresh_interval($3))
diff --git proto/rpki/rpki.h proto/rpki/rpki.h
index 8a5c38fd..e67eb0e3 100644
--- proto/rpki/rpki.h
+++ proto/rpki/rpki.h
@@ -116,6 +116,7 @@ struct rpki_proto {
 struct rpki_config {
   struct proto_config c;
   const char *hostname;/* Full domain name or 
stringified IP address of cache server */
+  ip_addr local_ip;/* Source address to use */
   ip_addr ip;  /* IP address of cache server or 
IPA_NONE */
   u16 port;/* Port number of cache server */
   struct rpki_tr_config tr_config; /* Specific transport configuration 
structure */
diff --git proto/rpki/transport.c proto/rpki/transport.c
index 81bd6dd8..26571977 100644
--- proto/rpki/transport.c
+++ proto/rpki/transport.c
@@ -82,6 +82,7 @@ rpki_tr_open(struct rpki_tr_sock *tr)
   sk->daddr = cf->ip;
   sk->dport = cf->port;
   sk->host = cf->hostname;
+  sk->saddr = cf->local_ip;
   sk->rbsize = RPKI_RX_BUFFER_SIZE;
   sk->tbsize = RPKI_TX_BUFFER_SIZE;
   sk->tos = IP_PREC_INTERNET_CONTROL;


[patch] SendHoldTimer BGP Error code

2024-02-17 Thread Job Snijders via Bird-users
Dear all,

IANA registered an "Early Allocation" BGP Error code for
draft-ietf-idr-bgp-sendholdtimer, see
https://www.iana.org/assignments/bgp-parameters/bgp-parameters.xhtml#bgp-parameters-3

The below changeset aligns bird with IANA's Border Gateway Protocol
(BGP) Parameters registry.

Kind regards,

Job

diff --git proto/bgp/bgp.c proto/bgp/bgp.c
index e97b45e6..bd6e90d6 100644
--- proto/bgp/bgp.c
+++ proto/bgp/bgp.c
@@ -1059,12 +1059,13 @@ bgp_send_hold_timeout(timer *t)
   struct bgp_conn *conn = t->data;
   struct bgp_proto *p = conn->bgp;
 
+  DBG("BGP: Send hold timeout\n");
+
   if (conn->state == BS_CLOSE)
 return;
 
-  /* Error codes not yet assigned by IANA */
-  uint code = 4;
-  uint subcode = 1;
+  uint code = 8;
+  uint subcode = 0;
 
   /* Like bgp_error() but without NOTIFICATION */
   bgp_log_error(p, BE_BGP_TX, "Error", code, subcode, NULL, 0);
diff --git proto/bgp/packets.c proto/bgp/packets.c
index 18a226fb..8320991a 100644
--- proto/bgp/packets.c
+++ proto/bgp/packets.c
@@ -3275,7 +3275,6 @@ static struct {
   { 3, 10, "Invalid network field" },
   { 3, 11, "Malformed AS_PATH" },
   { 4, 0, "Hold timer expired" },
-  { 4, 1, "Send hold timer expired" }, /* Provisional 
[draft-ietf-idr-bgp-sendholdtimer] */
   { 5, 0, "Finite state machine error" }, /* Subcodes are according to 
[RFC6608] */
   { 5, 1, "Unexpected message in OpenSent state" },
   { 5, 2, "Unexpected message in OpenConfirm state" },
@@ -3290,7 +3289,8 @@ static struct {
   { 6, 7, "Connection collision resolution" },
   { 6, 8, "Out of Resources" },
   { 7, 0, "Invalid ROUTE-REFRESH message" }, /* [RFC7313] */
-  { 7, 1, "Invalid ROUTE-REFRESH message length" } /* [RFC7313] */
+  { 7, 1, "Invalid ROUTE-REFRESH message length" }, /* [RFC7313] */
+  { 8, 0, "Send hold timer expired" } /* [draft-ietf-idr-bgp-sendholdtimer] */
 };
 
 /**


Re: Overloading RTR to load IRR (Was: Defines for mixed IPv6/IPv4)

2024-01-25 Thread Job Snijders via Bird-users
On Thu, Jan 25, 2024 at 11:55:14AM +0100, Erin Shepherd wrote:
> Spitballing slightly here, but could you avoid this problem by adding
> 0.0.0.0/0+ ::0/0+ AS0 RoAs to the table and accepting ROA Unknowns?
> 
> Obviously the disadvantage here is that if your IRR RTR server goes
> down you're basically unfiltered, but it at least avoids the
> availability problem

That's an interesting approach, thanks for sharing this.

Kind regards,

Job


Re: Overloading RTR to load IRR (Was: Defines for mixed IPv6/IPv4)

2024-01-25 Thread Job Snijders via Bird-users
On Thu, Jan 25, 2024 at 11:41:25AM +0100, Job Snijders wrote:
> On Thu, Jan 25, 2024 at 11:13:51AM +0100, Jeroen Massar via Bird-users wrote:
> > a quick stab at generating the slurm file:
> 
> why use SLURM though? SLURM doesn't have a 'maxLength' field like the
> regular JSON input formatted in this style has:
> https://console.rpki-client.org/rpki.json - which might help with
> aggregation.

Err... scratch that, SLURM does have a field for that, but it's called
"maxPrefixLength". Anyway, using the 'normal' JSON format as input for
the IRR RTR server should make for a more concise smaller JSON file
compared to using SLURM prefixAssertions, but both could be used.

Kind regards,

Job


Overloading RTR to load IRR (Was: Defines for mixed IPv6/IPv4)

2024-01-25 Thread Job Snijders via Bird-users
On Thu, Jan 25, 2024 at 11:13:51AM +0100, Jeroen Massar via Bird-users wrote:
> a quick stab at generating the slurm file:

why use SLURM though? SLURM doesn't have a 'maxLength' field like the
regular JSON input formatted in this style has:
https://console.rpki-client.org/rpki.json - which might help with
aggregation.

More importantly, a risk I perceive with overloading RTR functionality
to load IRR data into routers is in the realm of fail-safes:

For RPKI-derived data, most ISPs do something along the lines of:

   if (roa_check(rpki, net, bgp_path.last) = ROA_INVALID) then reject;

For IRR-derived data, you'd have to do:

   if (roa_check(irr, net, peerasn) != ROA_VALID) then reject;

The above means that suddenly your EBGP routers/route servers have a
very hard dependency on the IRR RTR session being up in order to accept
routes (fail closed), whereas how the RPKI-derived data is used is in a
'fail open' fashion.

The above friction goes back to RPKI ROAs being defined as "if ROAs
exist, all BGP routes that do not match any of the ROAs are invalid"
(following the RFC 6811 algorithm), but for IRR route/route6 objects
such a definition was never established, those predate the RFC 6811
algorithm.

Kind regards,

Job


Re: Expiration for ROA tables until when the VRP is valid?

2023-03-07 Thread Job Snijders via Bird-users
On Tue, Mar 07, 2023 at 01:00:40PM +0100, Job Snijders wrote:
> On Tue, Mar 07, 2023 at 12:52:16PM +0100, Ondrej Zajicek wrote:
> > If i understand it correctly, it is relevant just for static ROA
> > records?
> 
> Correct
> 
> > I assume these expiration records are based on wall-clock time instead
> > of relative time?
> 
> Correct, wall-clock time expressed as number of seconds that have
> elapsed since 00:00:00 UTC on 1 January 1970 (Unix time).
> 
> > It is a question whether we should handle expiration of such static routes
> > properly / dynamically, or just a one-time check during reconfiguration.
> > That would be order of magnitude simpler, but it is also a thing that
> > could be done by a trivial script preprocessing the included config file
> > with static ROA records.
> 
> For what its worth: OpenBGPD and StayRTR handle it 'dynamically', but
> not in absolute real time: both implementations walk a table every few
> (~3) minutes to check for newly expired entries.

To follow up with a mock configuration example, please see below.

The %lld after keyword 'expires' is calculated by OpenBSD's rpki-client
by walking the chain of authorities and keeping tabs on the 'soonest'
expiration moment. In other words, the 'expires' value is the moment a
RTR server would send an IPv4/IPv6 Prefix RTR PDU with the 'withdraw'
flag set.

(rpki-client supports generating static ROA configurations in BIRD's
format, if BIRD supports 'expires' I'll update rpki-client/output-bird.c) 

For a constantly up-to-date set of data, please download
https://console.rpki-client.org/vrps.json (as often as you want), and
transform it into BIRD's configuration format. The 'expires' key in the
JSON objects is what one would use to instruct the BGP daemon or RTR
server when to expire.

I am happy to help with this project.

Kind regards,

Job



roa4 table ROAS4;
roa6 table ROAS6;

protocol static {
  roa4 { table ROAS4; };
  route 1.0.0.0/24 max 24 as 13335 expires 1678717104; # Mon Mar 13 14:18:24 
UTC 2023
  route 1.0.4.0/24 max 24 as 38803 expires 1678751858; # Mon Mar 13 23:57:38 
UTC 2023
  route 1.7.142.0/24 max 24 as 132215 expires 1678789300; # Tue Mar 14 10:21:40 
UTC 2023
  # ...
  # left the thousands of other static entries out for brevity
}

protocol static {
  roa6 { table ROAS6; };
  route 2001:200::/32 max 32 as 2500 expires 1678724530;
  route 2001:200:136::/48 max 48 as 9367 expires 1678724530;
  route 2001:200:1ba::/48 max 48 as 24047 expires 1678724530;
  # ...
  # left the thousands of other static entries out for brevity
}


Re: Expiration for ROA tables until when the VRP is valid?

2023-03-07 Thread Job Snijders via Bird-users
On Tue, Mar 07, 2023 at 12:52:16PM +0100, Ondrej Zajicek wrote:
> If i understand it correctly, it is relevant just for static ROA
> records?

Correct

> I assume these expiration records are based on wall-clock time instead
> of relative time?

Correct, wall-clock time expressed as number of seconds that have
elapsed since 00:00:00 UTC on 1 January 1970 (Unix time).

> It is a question whether we should handle expiration of such static routes
> properly / dynamically, or just a one-time check during reconfiguration.
> That would be order of magnitude simpler, but it is also a thing that
> could be done by a trivial script preprocessing the included config file
> with static ROA records.

For what its worth: OpenBGPD and StayRTR handle it 'dynamically', but
not in absolute real time: both implementations walk a table every few
(~3) minutes to check for newly expired entries.

Kind regards,

Job


Re: Expiration for ROA tables until when the VRP is valid?

2023-03-06 Thread Job Snijders via Bird-users
On Tue, Mar 07, 2023 at 01:01:36AM +0100, Robert Scheck wrote:
> On Sun, 19 Sep 2021, Robert Scheck wrote:
> > rpki-client recently implemented the "expires" instruction for roa-sets
> > that OpenBGPD provides [1][2]. As of writing, BIRD does not seem to have
> > something similar...any chance for it? From my understanding this only
> > applies to included ROA files with VRP, not to RTR.
> > 
> > [1] https://man.openbsd.org/bgpd.conf#roa-set
> > [2] 
> > https://github.com/rpki-client/rpki-client-openbsd/commit/7bf63da6ec80f37bd72dbab99a5a71cee5707dc2
> 
> Please let me kindly repeat my initial question from about 1.5 years ago:
> Is there any chance for getting this feature into BIRD, too? Job provided
> some more details and insights as part of the original thread:
> 
>  - https://bird.network.cz/pipermail/bird-users/2021-September/015725.html
>  - https://bird.network.cz/pipermail/bird-users/2021-September/015726.html

Related, RPKI-To-Router implementation StayRTR recently received support
for honoring configured expiration timers for individual RPKI VRPs. [1]
When the expiration moment (noted as a unix timestamp) of a given RPKI
ROA/VRP has arrived, the StayRTR daemon will emit RTR Withdraws towards its
clients for that ROA/VRP.

Indeed, OpenBGPD (when loading VRPs from on-disk configuration) also
supports similar functionality, which has proven to make various
deployment scenarios less prone to faults in configuration pipelines.

I'd love it if BIRD also allows operators to specify the expiration
moment of a given ROA/VRP in the on-disk configuration through a
keyword + timestamp.

Kind regards,

Job

[1]: 
https://github.com/bgp/stayrtr/commit/13659dd27e1b792dd2a7b9f439ef0a4159d862d9