Ah, missed that. Thanks for the correction. Joost. On Thu, Oct 20, 2022 at 5:36 PM Bastien TEINTURIER <bast...@acinq.fr> wrote:
> Hi Joost, > > I need more time to review your proposed change, but I wanted to quickly > correct a misunderstanding you had in quoting eclair's code: > > > Unfortunately it is possible for nodes on the route to hide themselves. > > If they return random data as the failure message, the sender won't know > > where the failure happened. Some senders then penalize all nodes that > > were part of the route [4][5]. This may exclude perfectly reliable nodes > > from being used for future payments. > > Eclair's code does not penalize nodes for future payment attempts in this > case. It only ignores them for the retries of that particular payment. > > Cheers, > Bastien > > Le mer. 19 oct. 2022 à 13:13, Joost Jager <joost.ja...@gmail.com> a > écrit : > >> Hi list, >> >> I wanted to get back to a long-standing issue in Lightning: gaps in error >> attribution. I've posted about this before back in 2019 [1]. >> >> Error attribution is important to properly penalize nodes after a payment >> failure occurs. The goal of the penalty is to give the next attempt a >> better chance at succeeding. In the happy failure flow, the sender is able >> to determine the origin of the failure and penalizes a single node or pair >> of nodes. >> >> Unfortunately it is possible for nodes on the route to hide themselves. >> If they return random data as the failure message, the sender won't know >> where the failure happened. Some senders then penalize all nodes that were >> part of the route [4][5]. This may exclude perfectly reliable nodes from >> being used for future payments. Other senders penalize no nodes at all >> [6][7], which allows the offending node to keep the disruption going. >> >> A special case of this is a final node sending back random data. Senders >> that penalize all nodes will keep looking for alternative routes. But >> because each alternative route still ends with that same final node, the >> sender will ultimately penalize all of its peers and possibly a lot of the >> rest of the network too. >> >> I can think of various reasons for exploiting this weakness. One is just >> plain grievance for whatever reason. Another one is to attract more traffic >> by getting competing routing nodes penalized. Or the goal could be to >> sufficiently mess up reputation tracking of a specific sender node to make >> it hard for that node to make further payments. >> >> Related to this are delays in the path. A node can delay propagating back >> a failure message and the sender won't be able to determine which node did >> it. >> >> The link at the top of this post [1] describes a way to address both >> unreadable failure messages as well as delays by letting each node on the >> route append a timestamp and hmac to the failure message. The great >> challenge is to do this in such a way that nodes don’t learn their position >> in the path. >> >> I'm revisiting this idea, and have prototyped various ways to implement >> it. In the remainder of this post, I will describe the variant that I >> thought works best (so far). >> >> # Failure message format >> >> The basic idea of the new format is to let each node (not just the error >> source) commit to the failure message when it passes it back by adding an >> hmac. The sender verifies all hmacs upon receipt of the failure message. >> This makes it impossible for any of the nodes to modify the failure message >> without revealing that they might have played a part in the modification. >> It won’t be possible for the sender to pinpoint an exact node, because >> either end of a communication channel may have modified the message. >> Pinpointing a pair of nodes however is good enough, and is commonly done >> for regular onion failures too. >> >> On the highest level, the new failure message consists of three parts: >> >> `message` (var len) | `payloads` (fixed len) | `hmacs` (fixed len) >> >> * `message` is the standard onion failure message as described in [2], >> but without the hmac. The hmac is now part of `hmacs` and doesn't need to >> be repeated. >> >> * `payloads` is a fixed length array that contains space for each node >> (`hop_payload`) on the route to add data to return to the sender. Ideally >> the contents and size of `hop_payload` is signaled so that future >> extensions don’t require all nodes to upgrade. For now, we’ll assume the >> following 9-byte format: >> >> `is_final` (1 byte) | `duration` (8 bytes) >> >> `is_final` indicates whether this node is the failure source. The >> sender uses `is_final` to determine when to stop the >> decryption/verification process. >> >> `duration` is the time in milliseconds that the node held the htlc. By >> observing the series of reported durations, the sender is able to pinpoint >> a delay down to a pair of nodes. >> >> The `hop_payload` is repeated 27 times (the maximum route length). >> >> Every hop shifts `payloads` 9 bytes to the right and puts its own >> `hop_payload` in the 9 left-most bytes. >> >> * `hmacs` is a fixed length array where nodes add their hmacs as the >> failure message travels back to the sender. >> >> To keep things simple, I'll describe the format as if the maximum route >> length was only three hops (instead of 27): >> >> `hmac_0_2` | `hmac_0_1`| `hmac_0_0`| `hmac_1_1`| `hmac_1_0`| `hmac_2_0` >> >> Because nodes don't know their position in the path, it's unclear to >> them what part of the failure message they are supposed to include in the >> hmac. They can't just include everything, because if part of that data is >> deleted later (to keep the message size fixed) it opens up the possibility >> for nodes to blame others. >> >> The solution here is to provide hmacs for all possible positions. The >> last node that updated `hmacs` added `hmac_0_2`, `hmac_0_1` and `hmac_0_0` >> to the block. Each hmac corresponds to a presumed position in the path, >> where `hmac_0_2` is for the longest path (2 downstream hops) and `hmac_0_0` >> for the shortest (node is the error source). >> >> `hmac_x_y` is the hmac added by node x (counted from the node that is >> currently handling the failure message) assuming that this node is y hops >> away from the final node. >> >> Before an hop adds its hmacs, it first deletes some of the previous >> hmacs. This keeps the failure message at a fixed length. The removed hmacs >> are the ones that cannot be useful anymore. If node 0 adds itself, the >> former node 0 (now node 1) cannot be at the first position anymore. The >> former node 1 (now node 2) cannot be at the second position anymore. The >> former node 2 cannot be the source of the error anymore and isn’t >> represented in the failure message any longer. The corresponding hmacs (the >> now non-existent `hmac_0_2`, `hmac_1_1` and `hmac_2_0`) are deleted by node >> 0. >> >> Deleting the useless data reduces the number of hmacs (and roughly the >> total failure message size) to half. >> >> The delete operation transform the fields above to: >> >> <empty> | <empty> | <empty> | `hmac_0_1`| `hmac_0_0`| `hmac_1_0` >> >> The exact data that is included in each hmac is: >> * `message` >> * the node’s own `hop_payload` and a set of downstream `hop_payload`s, >> depending on assumed position >> * a set of downstream node hmacs, depending on assumed position >> >> For example `hmac_0_1` is based on: >> >> `message` | `hop_payload[0]` | `hop_payload[1]` | `hmac_1_0` >> >> If the node that is currently handling the failure message is one hop >> away from the final node, it needs to cover its own `hop_payload[0]`, the >> final node `hop_payload[1]` and the final node hmac `hmac_1_0`. >> >> A longer path is committed to in `hmac_0_2`: >> >> `message` | `hop_payload[0]` | `hop_payload[1]` | `hop_payload[2]` | >> `hmac_1_1` | `hmac_2_0` >> >> The current node is two hops away from the final node. It needs to cover >> its own `hop_payload[0]` as well as `hop_payload[1]` and `hop_payload[2]` >> for the next and final hops. Additionally it covers the next hop `hmac_1_1` >> and final hop `hmac_2_0`, which correspond to the positions of those nodes >> in the path that is assumed for `hmac_0_2`. >> >> With this information, the sender is able to verify the longest chain of >> hmacs until it encounters a `hop_payload` with `is_final` set. >> >> If any of the nodes messes with any byte in the failure message, the >> sender is always able to determine a pair of nodes that the offending node >> is part of. This statement can be verified through reasoning, but to be >> sure I also tested it with code. I’ve simulated a malicious node that >> modifies a byte of the failure message at index x and observed the error >> source as determined by the sender. For every x, the sender reports the >> same correct pair. >> >> # Size >> >> The obvious downside of the scheme above is the size. Given a maximum of >> 27 hops, the `hmacs` block contains 27+26+25+...+1=378 hmacs of 32 bytes >> each. This makes for a total size of 12 KB. >> >> It could be the case though that it is not possible to devise a more >> compact scheme that also preserves the existing privacy guarantees. I know >> that smart people have spent time on this problem, but nonetheless no >> better solution has come up in the past years. A proof of its non-existence >> would be interesting for sure. >> >> I personally think the size increase is justified to fix this >> vulnerability in Lightning. Also if failures are expected to become more >> rare going forward, size becomes less relevant to the overall operation of >> the network. >> >> Another option is to reduce the maximum number of hops. It is >> questionable whether 27 hops are really needed in practice, and such long >> routes also contribute to latency and capital lock up. If for example the >> new failure message could only be used with routes up to 10 hops, the total >> number of hmacs would drop from 378 to 55. This makes for a total message >> size of about 2 KB. >> >> # Signaling >> >> For backwards compatibility nodes need to know what algorithm they should >> run to generate or transform the failure message. This can be signaled by >> the sender via a tlv onion field. A failure message format signaling >> mechanism is also discussed in the context of long failure messages [3]. >> The failure message described in this post could be just another version. >> >> Additionally, intermediate nodes need to advertise their capability to >> transform the new format through a feature bit. >> >> # Delayed successes >> >> It’s not just failures that can be delayed. Successes can too. In that >> case, there is no failure message to improve. It could be an option to add >> the same `payloads` and `hmacs` blocks to the `update_fulfill_htlc` message. >> >> [1] >> https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-June/002015.html >> [2] >> https://github.com/lightning/bolts/blob/master/04-onion-routing.md#returning-errors >> [3] https://github.com/lightning/bolts/pull/1021 >> [4] >> https://github.com/lightningnetwork/lnd/blob/4fbd608b734f348d7e79fbfc7feaecc5c6c33a90/routing/result_interpretation.go#L419 >> [5] >> >> https://github.com/ACINQ/eclair/blob/a0433aa0c027c9be618c5afe18e7f91642a7f372/eclair-core/src/main/scala/fr/acinq/eclair/payment/PaymentEvents.scala#L221 >> [6] >> https://github.com/ElementsProject/lightning/blob/62bfed9a8df8731be44ba4e86afb08a5d28a4442/plugins/libplugin-pay.c#L1461 >> [7] >> https://github.com/lightningdevkit/rust-lightning/blob/e61f3a238a70cbac87209e223b7c396108a49b97/lightning-invoice/src/payment.rs#L682 >> >> _______________________________________________ >> Lightning-dev mailing list >> Lightning-dev@lists.linuxfoundation.org >> https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev >> > On Thu, Oct 20, 2022 at 5:36 PM Bastien TEINTURIER <bast...@acinq.fr> wrote: > Hi Joost, > > I need more time to review your proposed change, but I wanted to quickly > correct a misunderstanding you had in quoting eclair's code: > > > Unfortunately it is possible for nodes on the route to hide themselves. > > If they return random data as the failure message, the sender won't know > > where the failure happened. Some senders then penalize all nodes that > > were part of the route [4][5]. This may exclude perfectly reliable nodes > > from being used for future payments. > > Eclair's code does not penalize nodes for future payment attempts in this > case. It only ignores them for the retries of that particular payment. > > Cheers, > Bastien > > Le mer. 19 oct. 2022 à 13:13, Joost Jager <joost.ja...@gmail.com> a > écrit : > >> Hi list, >> >> I wanted to get back to a long-standing issue in Lightning: gaps in error >> attribution. I've posted about this before back in 2019 [1]. >> >> Error attribution is important to properly penalize nodes after a payment >> failure occurs. The goal of the penalty is to give the next attempt a >> better chance at succeeding. In the happy failure flow, the sender is able >> to determine the origin of the failure and penalizes a single node or pair >> of nodes. >> >> Unfortunately it is possible for nodes on the route to hide themselves. >> If they return random data as the failure message, the sender won't know >> where the failure happened. Some senders then penalize all nodes that were >> part of the route [4][5]. This may exclude perfectly reliable nodes from >> being used for future payments. Other senders penalize no nodes at all >> [6][7], which allows the offending node to keep the disruption going. >> >> A special case of this is a final node sending back random data. Senders >> that penalize all nodes will keep looking for alternative routes. But >> because each alternative route still ends with that same final node, the >> sender will ultimately penalize all of its peers and possibly a lot of the >> rest of the network too. >> >> I can think of various reasons for exploiting this weakness. One is just >> plain grievance for whatever reason. Another one is to attract more traffic >> by getting competing routing nodes penalized. Or the goal could be to >> sufficiently mess up reputation tracking of a specific sender node to make >> it hard for that node to make further payments. >> >> Related to this are delays in the path. A node can delay propagating back >> a failure message and the sender won't be able to determine which node did >> it. >> >> The link at the top of this post [1] describes a way to address both >> unreadable failure messages as well as delays by letting each node on the >> route append a timestamp and hmac to the failure message. The great >> challenge is to do this in such a way that nodes don’t learn their position >> in the path. >> >> I'm revisiting this idea, and have prototyped various ways to implement >> it. In the remainder of this post, I will describe the variant that I >> thought works best (so far). >> >> # Failure message format >> >> The basic idea of the new format is to let each node (not just the error >> source) commit to the failure message when it passes it back by adding an >> hmac. The sender verifies all hmacs upon receipt of the failure message. >> This makes it impossible for any of the nodes to modify the failure message >> without revealing that they might have played a part in the modification. >> It won’t be possible for the sender to pinpoint an exact node, because >> either end of a communication channel may have modified the message. >> Pinpointing a pair of nodes however is good enough, and is commonly done >> for regular onion failures too. >> >> On the highest level, the new failure message consists of three parts: >> >> `message` (var len) | `payloads` (fixed len) | `hmacs` (fixed len) >> >> * `message` is the standard onion failure message as described in [2], >> but without the hmac. The hmac is now part of `hmacs` and doesn't need to >> be repeated. >> >> * `payloads` is a fixed length array that contains space for each node >> (`hop_payload`) on the route to add data to return to the sender. Ideally >> the contents and size of `hop_payload` is signaled so that future >> extensions don’t require all nodes to upgrade. For now, we’ll assume the >> following 9-byte format: >> >> `is_final` (1 byte) | `duration` (8 bytes) >> >> `is_final` indicates whether this node is the failure source. The >> sender uses `is_final` to determine when to stop the >> decryption/verification process. >> >> `duration` is the time in milliseconds that the node held the htlc. By >> observing the series of reported durations, the sender is able to pinpoint >> a delay down to a pair of nodes. >> >> The `hop_payload` is repeated 27 times (the maximum route length). >> >> Every hop shifts `payloads` 9 bytes to the right and puts its own >> `hop_payload` in the 9 left-most bytes. >> >> * `hmacs` is a fixed length array where nodes add their hmacs as the >> failure message travels back to the sender. >> >> To keep things simple, I'll describe the format as if the maximum route >> length was only three hops (instead of 27): >> >> `hmac_0_2` | `hmac_0_1`| `hmac_0_0`| `hmac_1_1`| `hmac_1_0`| `hmac_2_0` >> >> Because nodes don't know their position in the path, it's unclear to >> them what part of the failure message they are supposed to include in the >> hmac. They can't just include everything, because if part of that data is >> deleted later (to keep the message size fixed) it opens up the possibility >> for nodes to blame others. >> >> The solution here is to provide hmacs for all possible positions. The >> last node that updated `hmacs` added `hmac_0_2`, `hmac_0_1` and `hmac_0_0` >> to the block. Each hmac corresponds to a presumed position in the path, >> where `hmac_0_2` is for the longest path (2 downstream hops) and `hmac_0_0` >> for the shortest (node is the error source). >> >> `hmac_x_y` is the hmac added by node x (counted from the node that is >> currently handling the failure message) assuming that this node is y hops >> away from the final node. >> >> Before an hop adds its hmacs, it first deletes some of the previous >> hmacs. This keeps the failure message at a fixed length. The removed hmacs >> are the ones that cannot be useful anymore. If node 0 adds itself, the >> former node 0 (now node 1) cannot be at the first position anymore. The >> former node 1 (now node 2) cannot be at the second position anymore. The >> former node 2 cannot be the source of the error anymore and isn’t >> represented in the failure message any longer. The corresponding hmacs (the >> now non-existent `hmac_0_2`, `hmac_1_1` and `hmac_2_0`) are deleted by node >> 0. >> >> Deleting the useless data reduces the number of hmacs (and roughly the >> total failure message size) to half. >> >> The delete operation transform the fields above to: >> >> <empty> | <empty> | <empty> | `hmac_0_1`| `hmac_0_0`| `hmac_1_0` >> >> The exact data that is included in each hmac is: >> * `message` >> * the node’s own `hop_payload` and a set of downstream `hop_payload`s, >> depending on assumed position >> * a set of downstream node hmacs, depending on assumed position >> >> For example `hmac_0_1` is based on: >> >> `message` | `hop_payload[0]` | `hop_payload[1]` | `hmac_1_0` >> >> If the node that is currently handling the failure message is one hop >> away from the final node, it needs to cover its own `hop_payload[0]`, the >> final node `hop_payload[1]` and the final node hmac `hmac_1_0`. >> >> A longer path is committed to in `hmac_0_2`: >> >> `message` | `hop_payload[0]` | `hop_payload[1]` | `hop_payload[2]` | >> `hmac_1_1` | `hmac_2_0` >> >> The current node is two hops away from the final node. It needs to cover >> its own `hop_payload[0]` as well as `hop_payload[1]` and `hop_payload[2]` >> for the next and final hops. Additionally it covers the next hop `hmac_1_1` >> and final hop `hmac_2_0`, which correspond to the positions of those nodes >> in the path that is assumed for `hmac_0_2`. >> >> With this information, the sender is able to verify the longest chain of >> hmacs until it encounters a `hop_payload` with `is_final` set. >> >> If any of the nodes messes with any byte in the failure message, the >> sender is always able to determine a pair of nodes that the offending node >> is part of. This statement can be verified through reasoning, but to be >> sure I also tested it with code. I’ve simulated a malicious node that >> modifies a byte of the failure message at index x and observed the error >> source as determined by the sender. For every x, the sender reports the >> same correct pair. >> >> # Size >> >> The obvious downside of the scheme above is the size. Given a maximum of >> 27 hops, the `hmacs` block contains 27+26+25+...+1=378 hmacs of 32 bytes >> each. This makes for a total size of 12 KB. >> >> It could be the case though that it is not possible to devise a more >> compact scheme that also preserves the existing privacy guarantees. I know >> that smart people have spent time on this problem, but nonetheless no >> better solution has come up in the past years. A proof of its non-existence >> would be interesting for sure. >> >> I personally think the size increase is justified to fix this >> vulnerability in Lightning. Also if failures are expected to become more >> rare going forward, size becomes less relevant to the overall operation of >> the network. >> >> Another option is to reduce the maximum number of hops. It is >> questionable whether 27 hops are really needed in practice, and such long >> routes also contribute to latency and capital lock up. If for example the >> new failure message could only be used with routes up to 10 hops, the total >> number of hmacs would drop from 378 to 55. This makes for a total message >> size of about 2 KB. >> >> # Signaling >> >> For backwards compatibility nodes need to know what algorithm they should >> run to generate or transform the failure message. This can be signaled by >> the sender via a tlv onion field. A failure message format signaling >> mechanism is also discussed in the context of long failure messages [3]. >> The failure message described in this post could be just another version. >> >> Additionally, intermediate nodes need to advertise their capability to >> transform the new format through a feature bit. >> >> # Delayed successes >> >> It’s not just failures that can be delayed. Successes can too. In that >> case, there is no failure message to improve. It could be an option to add >> the same `payloads` and `hmacs` blocks to the `update_fulfill_htlc` message. >> >> [1] >> https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-June/002015.html >> [2] >> https://github.com/lightning/bolts/blob/master/04-onion-routing.md#returning-errors >> [3] https://github.com/lightning/bolts/pull/1021 >> [4] >> https://github.com/lightningnetwork/lnd/blob/4fbd608b734f348d7e79fbfc7feaecc5c6c33a90/routing/result_interpretation.go#L419 >> [5] >> >> https://github.com/ACINQ/eclair/blob/a0433aa0c027c9be618c5afe18e7f91642a7f372/eclair-core/src/main/scala/fr/acinq/eclair/payment/PaymentEvents.scala#L221 >> [6] >> https://github.com/ElementsProject/lightning/blob/62bfed9a8df8731be44ba4e86afb08a5d28a4442/plugins/libplugin-pay.c#L1461 >> [7] >> https://github.com/lightningdevkit/rust-lightning/blob/e61f3a238a70cbac87209e223b7c396108a49b97/lightning-invoice/src/payment.rs#L682 >> >> _______________________________________________ >> Lightning-dev mailing list >> Lightning-dev@lists.linuxfoundation.org >> https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev >> >
_______________________________________________ Lightning-dev mailing list Lightning-dev@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev