Re: [Lightning-dev] Fat Errors

Joost Jager Thu, 10 Nov 2022 07:25:29 -0800

Pushed a golang implementation of the fat errors here:
https://github.com/lightningnetwork/lightning-onion/pull/60


Joost.

On Wed, Oct 19, 2022 at 1:12 PM Joost Jager <joost.ja...@gmail.com> wrote:

> Hi list,
>
> I wanted to get back to a long-standing issue in Lightning: gaps in error
> attribution. I've posted about this before back in 2019 [1].
>
> Error attribution is important to properly penalize nodes after a payment
> failure occurs. The goal of the penalty is to give the next attempt a
> better chance at succeeding. In the happy failure flow, the sender is able
> to determine the origin of the failure and penalizes a single node or pair
> of nodes.
>
> Unfortunately it is possible for nodes on the route to hide themselves. If
> they return random data as the failure message, the sender won't know where
> the failure happened. Some senders then penalize all nodes that were part
> of the route [4][5]. This may exclude perfectly reliable nodes from being
> used for future payments. Other senders penalize no nodes at all [6][7],
> which allows the offending node to keep the disruption going.
>
> A special case of this is a final node sending back random data. Senders
> that penalize all nodes will keep looking for alternative routes. But
> because each alternative route still ends with that same final node, the
> sender will ultimately penalize all of its peers and possibly a lot of the
> rest of the network too.
>
> I can think of various reasons for exploiting this weakness. One is just
> plain grievance for whatever reason. Another one is to attract more traffic
> by getting competing routing nodes penalized. Or the goal could be to
> sufficiently mess up reputation tracking of a specific sender node to make
> it hard for that node to make further payments.
>
> Related to this are delays in the path. A node can delay propagating back
> a failure message and the sender won't be able to determine which node did
> it.
>
> The link at the top of this post [1] describes a way to address both
> unreadable failure messages as well as delays by letting each node on the
> route append a timestamp and hmac to the failure message. The great
> challenge is to do this in such a way that nodes don’t learn their position
> in the path.
>
> I'm revisiting this idea, and have prototyped various ways to implement
> it. In the remainder of this post, I will describe the variant that I
> thought works best (so far).
>
> # Failure message format
>
> The basic idea of the new format is to let each node (not just the error
> source) commit to the failure message when it passes it back by adding an
> hmac. The sender verifies all hmacs upon receipt of the failure message.
> This makes it impossible for any of the nodes to modify the failure message
> without revealing that they might have played a part in the modification.
> It won’t be possible for the sender to pinpoint an exact node, because
> either end of a communication channel may have modified the message.
> Pinpointing a pair of nodes however is good enough, and is commonly done
> for regular onion failures too.
>
> On the highest level, the new failure message consists of three parts:
>
> `message` (var len) | `payloads` (fixed len) | `hmacs` (fixed len)
>
> * `message` is the standard onion failure message as described in [2], but
> without the hmac. The hmac is now part of `hmacs` and doesn't need to be
> repeated.
>
> * `payloads` is a fixed length array that contains space for each node
> (`hop_payload`) on the route to add data to return to the sender. Ideally
> the contents and size of `hop_payload` is signaled so that future
> extensions don’t require all nodes to upgrade. For now, we’ll assume the
> following 9-byte format:
>
>   `is_final` (1 byte) | `duration` (8 bytes)
>
>   `is_final` indicates whether this node is the failure source. The sender
> uses `is_final` to determine when to stop the decryption/verification
> process.
>
>   `duration` is the time in milliseconds that the node held the htlc. By
> observing the series of reported durations, the sender is able to pinpoint
> a delay down to a pair of nodes.
>
>   The `hop_payload` is repeated 27 times (the maximum route length).
>
>   Every hop shifts `payloads` 9 bytes to the right and puts its own
> `hop_payload` in the 9 left-most bytes.
>
> * `hmacs` is a fixed length array where nodes add their hmacs as the
> failure message travels back to the sender.
>
>   To keep things simple, I'll describe the format as if the maximum route
> length was only three hops (instead of 27):
>
>   `hmac_0_2` | `hmac_0_1`| `hmac_0_0`| `hmac_1_1`| `hmac_1_0`| `hmac_2_0`
>
>   Because nodes don't know their position in the path, it's unclear to
> them what part of the failure message they are supposed to include in the
> hmac. They can't just include everything, because if part of that data is
> deleted later (to keep the message size fixed) it opens up the possibility
> for nodes to blame others.
>
>   The solution here is to provide hmacs for all possible positions. The
> last node that updated `hmacs` added `hmac_0_2`, `hmac_0_1` and `hmac_0_0`
> to the block. Each hmac corresponds to a presumed position in the path,
> where `hmac_0_2` is for the longest path (2 downstream hops) and `hmac_0_0`
> for the shortest (node is the error source).
>
>   `hmac_x_y` is the hmac added by node x (counted from the node that is
> currently handling the failure message) assuming that this node is y hops
> away from the final node.
>
> Before an hop adds its hmacs, it first deletes some of the previous hmacs.
> This keeps the failure message at a fixed length. The removed hmacs are the
> ones that cannot be useful anymore. If node 0 adds itself, the former node
> 0 (now node 1) cannot be at the first position anymore. The former node 1
> (now node 2) cannot be at the second position anymore. The former node 2
> cannot be the source of the error anymore and isn’t represented in the
> failure message any longer. The corresponding hmacs (the now non-existent
> `hmac_0_2`, `hmac_1_1` and `hmac_2_0`) are deleted by node 0.
>
> Deleting the useless data reduces the number of hmacs (and roughly the
> total failure message size) to half.
>
> The delete operation transform the fields above to:
>
> <empty> | <empty> | <empty> | `hmac_0_1`| `hmac_0_0`| `hmac_1_0`
>
> The exact data that is included in each hmac is:
>   * `message`
>   * the node’s own `hop_payload` and a set of downstream `hop_payload`s,
> depending on assumed position
>   * a set of downstream node hmacs, depending on assumed position
>
> For example `hmac_0_1` is based on:
>
> `message` | `hop_payload[0]` | `hop_payload[1]` | `hmac_1_0`
>
> If the node that is currently handling the failure message is one hop away
> from the final node, it needs to cover its own `hop_payload[0]`, the final
> node `hop_payload[1]` and the final node hmac `hmac_1_0`.
>
> A longer path is committed to in `hmac_0_2`:
>
> `message` | `hop_payload[0]` | `hop_payload[1]` | `hop_payload[2]` |
> `hmac_1_1` | `hmac_2_0`
>
> The current node is two hops away from the final node. It needs to cover
> its own `hop_payload[0]` as well as `hop_payload[1]` and `hop_payload[2]`
> for the next and final hops. Additionally it covers the next hop `hmac_1_1`
> and final hop `hmac_2_0`, which correspond to the positions of those nodes
> in the path that is assumed for `hmac_0_2`.
>
> With this information, the sender is able to verify the longest chain of
> hmacs until it encounters a `hop_payload` with `is_final` set.
>
> If any of the nodes messes with any byte in the failure message, the
> sender is always able to determine a pair of nodes that the offending node
> is part of. This statement can be verified through reasoning, but to be
> sure I also tested it with code. I’ve simulated a malicious node that
> modifies a byte of the failure message at index x and observed the error
> source as determined by the sender. For every x, the sender reports the
> same correct pair.
>
> # Size
>
> The obvious downside of the scheme above is the size. Given a maximum of
> 27 hops, the `hmacs` block contains 27+26+25+...+1=378 hmacs of 32 bytes
> each. This makes for a total size of 12 KB.
>
> It could be the case though that it is not possible to devise a more
> compact scheme that also preserves the existing privacy guarantees. I know
> that smart people have spent time on this problem, but nonetheless no
> better solution has come up in the past years. A proof of its non-existence
> would be interesting for sure.
>
> I personally think the size increase is justified to fix this
> vulnerability in Lightning. Also if failures are expected to become more
> rare going forward, size becomes less relevant to the overall operation of
> the network.
>
> Another option is to reduce the maximum number of hops. It is questionable
> whether 27 hops are really needed in practice, and such long routes also
> contribute to latency and capital lock up. If for example the new failure
> message could only be used with routes up to 10 hops, the total number of
> hmacs would drop from 378 to 55. This makes for a total message size of
> about 2 KB.
>
> # Signaling
>
> For backwards compatibility nodes need to know what algorithm they should
> run to generate or transform the failure message. This can be signaled by
> the sender via a tlv onion field. A failure message format signaling
> mechanism is also discussed in the context of long failure messages [3].
> The failure message described in this post could be just another version.
>
> Additionally, intermediate nodes need to advertise their capability to
> transform the new format through a feature bit.
>
> # Delayed successes
>
> It’s not just failures that can be delayed. Successes can too. In that
> case, there is no failure message to improve. It could be an option to add
> the same `payloads` and `hmacs` blocks to the `update_fulfill_htlc` message.
>
> [1]
> https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-June/002015.html
> [2]
> https://github.com/lightning/bolts/blob/master/04-onion-routing.md#returning-errors
> [3] https://github.com/lightning/bolts/pull/1021
> [4]
> https://github.com/lightningnetwork/lnd/blob/4fbd608b734f348d7e79fbfc7feaecc5c6c33a90/routing/result_interpretation.go#L419
> [5]
>
> https://github.com/ACINQ/eclair/blob/a0433aa0c027c9be618c5afe18e7f91642a7f372/eclair-core/src/main/scala/fr/acinq/eclair/payment/PaymentEvents.scala#L221
> [6]
> https://github.com/ElementsProject/lightning/blob/62bfed9a8df8731be44ba4e86afb08a5d28a4442/plugins/libplugin-pay.c#L1461
> [7]
> https://github.com/lightningdevkit/rust-lightning/blob/e61f3a238a70cbac87209e223b7c396108a49b97/lightning-invoice/src/payment.rs#L682
>
>

_______________________________________________
Lightning-dev mailing list
Lightning-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev

Re: [Lightning-dev] Fat Errors

Reply via email to