Pushed a golang implementation of the fat errors here: https://github.com/lightningnetwork/lightning-onion/pull/60
Joost. On Wed, Oct 19, 2022 at 1:12 PM Joost Jager <joost.ja...@gmail.com> wrote: > Hi list, > > I wanted to get back to a long-standing issue in Lightning: gaps in error > attribution. I've posted about this before back in 2019 [1]. > > Error attribution is important to properly penalize nodes after a payment > failure occurs. The goal of the penalty is to give the next attempt a > better chance at succeeding. In the happy failure flow, the sender is able > to determine the origin of the failure and penalizes a single node or pair > of nodes. > > Unfortunately it is possible for nodes on the route to hide themselves. If > they return random data as the failure message, the sender won't know where > the failure happened. Some senders then penalize all nodes that were part > of the route [4][5]. This may exclude perfectly reliable nodes from being > used for future payments. Other senders penalize no nodes at all [6][7], > which allows the offending node to keep the disruption going. > > A special case of this is a final node sending back random data. Senders > that penalize all nodes will keep looking for alternative routes. But > because each alternative route still ends with that same final node, the > sender will ultimately penalize all of its peers and possibly a lot of the > rest of the network too. > > I can think of various reasons for exploiting this weakness. One is just > plain grievance for whatever reason. Another one is to attract more traffic > by getting competing routing nodes penalized. Or the goal could be to > sufficiently mess up reputation tracking of a specific sender node to make > it hard for that node to make further payments. > > Related to this are delays in the path. A node can delay propagating back > a failure message and the sender won't be able to determine which node did > it. > > The link at the top of this post [1] describes a way to address both > unreadable failure messages as well as delays by letting each node on the > route append a timestamp and hmac to the failure message. The great > challenge is to do this in such a way that nodes don’t learn their position > in the path. > > I'm revisiting this idea, and have prototyped various ways to implement > it. In the remainder of this post, I will describe the variant that I > thought works best (so far). > > # Failure message format > > The basic idea of the new format is to let each node (not just the error > source) commit to the failure message when it passes it back by adding an > hmac. The sender verifies all hmacs upon receipt of the failure message. > This makes it impossible for any of the nodes to modify the failure message > without revealing that they might have played a part in the modification. > It won’t be possible for the sender to pinpoint an exact node, because > either end of a communication channel may have modified the message. > Pinpointing a pair of nodes however is good enough, and is commonly done > for regular onion failures too. > > On the highest level, the new failure message consists of three parts: > > `message` (var len) | `payloads` (fixed len) | `hmacs` (fixed len) > > * `message` is the standard onion failure message as described in [2], but > without the hmac. The hmac is now part of `hmacs` and doesn't need to be > repeated. > > * `payloads` is a fixed length array that contains space for each node > (`hop_payload`) on the route to add data to return to the sender. Ideally > the contents and size of `hop_payload` is signaled so that future > extensions don’t require all nodes to upgrade. For now, we’ll assume the > following 9-byte format: > > `is_final` (1 byte) | `duration` (8 bytes) > > `is_final` indicates whether this node is the failure source. The sender > uses `is_final` to determine when to stop the decryption/verification > process. > > `duration` is the time in milliseconds that the node held the htlc. By > observing the series of reported durations, the sender is able to pinpoint > a delay down to a pair of nodes. > > The `hop_payload` is repeated 27 times (the maximum route length). > > Every hop shifts `payloads` 9 bytes to the right and puts its own > `hop_payload` in the 9 left-most bytes. > > * `hmacs` is a fixed length array where nodes add their hmacs as the > failure message travels back to the sender. > > To keep things simple, I'll describe the format as if the maximum route > length was only three hops (instead of 27): > > `hmac_0_2` | `hmac_0_1`| `hmac_0_0`| `hmac_1_1`| `hmac_1_0`| `hmac_2_0` > > Because nodes don't know their position in the path, it's unclear to > them what part of the failure message they are supposed to include in the > hmac. They can't just include everything, because if part of that data is > deleted later (to keep the message size fixed) it opens up the possibility > for nodes to blame others. > > The solution here is to provide hmacs for all possible positions. The > last node that updated `hmacs` added `hmac_0_2`, `hmac_0_1` and `hmac_0_0` > to the block. Each hmac corresponds to a presumed position in the path, > where `hmac_0_2` is for the longest path (2 downstream hops) and `hmac_0_0` > for the shortest (node is the error source). > > `hmac_x_y` is the hmac added by node x (counted from the node that is > currently handling the failure message) assuming that this node is y hops > away from the final node. > > Before an hop adds its hmacs, it first deletes some of the previous hmacs. > This keeps the failure message at a fixed length. The removed hmacs are the > ones that cannot be useful anymore. If node 0 adds itself, the former node > 0 (now node 1) cannot be at the first position anymore. The former node 1 > (now node 2) cannot be at the second position anymore. The former node 2 > cannot be the source of the error anymore and isn’t represented in the > failure message any longer. The corresponding hmacs (the now non-existent > `hmac_0_2`, `hmac_1_1` and `hmac_2_0`) are deleted by node 0. > > Deleting the useless data reduces the number of hmacs (and roughly the > total failure message size) to half. > > The delete operation transform the fields above to: > > <empty> | <empty> | <empty> | `hmac_0_1`| `hmac_0_0`| `hmac_1_0` > > The exact data that is included in each hmac is: > * `message` > * the node’s own `hop_payload` and a set of downstream `hop_payload`s, > depending on assumed position > * a set of downstream node hmacs, depending on assumed position > > For example `hmac_0_1` is based on: > > `message` | `hop_payload[0]` | `hop_payload[1]` | `hmac_1_0` > > If the node that is currently handling the failure message is one hop away > from the final node, it needs to cover its own `hop_payload[0]`, the final > node `hop_payload[1]` and the final node hmac `hmac_1_0`. > > A longer path is committed to in `hmac_0_2`: > > `message` | `hop_payload[0]` | `hop_payload[1]` | `hop_payload[2]` | > `hmac_1_1` | `hmac_2_0` > > The current node is two hops away from the final node. It needs to cover > its own `hop_payload[0]` as well as `hop_payload[1]` and `hop_payload[2]` > for the next and final hops. Additionally it covers the next hop `hmac_1_1` > and final hop `hmac_2_0`, which correspond to the positions of those nodes > in the path that is assumed for `hmac_0_2`. > > With this information, the sender is able to verify the longest chain of > hmacs until it encounters a `hop_payload` with `is_final` set. > > If any of the nodes messes with any byte in the failure message, the > sender is always able to determine a pair of nodes that the offending node > is part of. This statement can be verified through reasoning, but to be > sure I also tested it with code. I’ve simulated a malicious node that > modifies a byte of the failure message at index x and observed the error > source as determined by the sender. For every x, the sender reports the > same correct pair. > > # Size > > The obvious downside of the scheme above is the size. Given a maximum of > 27 hops, the `hmacs` block contains 27+26+25+...+1=378 hmacs of 32 bytes > each. This makes for a total size of 12 KB. > > It could be the case though that it is not possible to devise a more > compact scheme that also preserves the existing privacy guarantees. I know > that smart people have spent time on this problem, but nonetheless no > better solution has come up in the past years. A proof of its non-existence > would be interesting for sure. > > I personally think the size increase is justified to fix this > vulnerability in Lightning. Also if failures are expected to become more > rare going forward, size becomes less relevant to the overall operation of > the network. > > Another option is to reduce the maximum number of hops. It is questionable > whether 27 hops are really needed in practice, and such long routes also > contribute to latency and capital lock up. If for example the new failure > message could only be used with routes up to 10 hops, the total number of > hmacs would drop from 378 to 55. This makes for a total message size of > about 2 KB. > > # Signaling > > For backwards compatibility nodes need to know what algorithm they should > run to generate or transform the failure message. This can be signaled by > the sender via a tlv onion field. A failure message format signaling > mechanism is also discussed in the context of long failure messages [3]. > The failure message described in this post could be just another version. > > Additionally, intermediate nodes need to advertise their capability to > transform the new format through a feature bit. > > # Delayed successes > > It’s not just failures that can be delayed. Successes can too. In that > case, there is no failure message to improve. It could be an option to add > the same `payloads` and `hmacs` blocks to the `update_fulfill_htlc` message. > > [1] > https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-June/002015.html > [2] > https://github.com/lightning/bolts/blob/master/04-onion-routing.md#returning-errors > [3] https://github.com/lightning/bolts/pull/1021 > [4] > https://github.com/lightningnetwork/lnd/blob/4fbd608b734f348d7e79fbfc7feaecc5c6c33a90/routing/result_interpretation.go#L419 > [5] > > https://github.com/ACINQ/eclair/blob/a0433aa0c027c9be618c5afe18e7f91642a7f372/eclair-core/src/main/scala/fr/acinq/eclair/payment/PaymentEvents.scala#L221 > [6] > https://github.com/ElementsProject/lightning/blob/62bfed9a8df8731be44ba4e86afb08a5d28a4442/plugins/libplugin-pay.c#L1461 > [7] > https://github.com/lightningdevkit/rust-lightning/blob/e61f3a238a70cbac87209e223b7c396108a49b97/lightning-invoice/src/payment.rs#L682 > >
_______________________________________________ Lightning-dev mailing list Lightning-dev@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev