Re: [Lightning-dev] Fat Errors

Joost Jager Thu, 20 Oct 2022 08:46:10 -0700

Ah, missed that. Thanks for the correction.
Joost.

On Thu, Oct 20, 2022 at 5:36 PM Bastien TEINTURIER <bast...@acinq.fr> wrote:


> Hi Joost,
>
> I need more time to review your proposed change, but I wanted to quickly
> correct a misunderstanding you had in quoting eclair's code:
>
> > Unfortunately it is possible for nodes on the route to hide themselves.
> > If they return random data as the failure message, the sender won't know
> > where the failure happened. Some senders then penalize all nodes that
> > were part of the route [4][5]. This may exclude perfectly reliable nodes
> > from being used for future payments.
>
> Eclair's code does not penalize nodes for future payment attempts in this
> case. It only ignores them for the retries of that particular payment.
>
> Cheers,
> Bastien
>
> Le mer. 19 oct. 2022 à 13:13, Joost Jager <joost.ja...@gmail.com> a
> écrit :
>
>> Hi list,
>>
>> I wanted to get back to a long-standing issue in Lightning: gaps in error
>> attribution. I've posted about this before back in 2019 [1].
>>
>> Error attribution is important to properly penalize nodes after a payment
>> failure occurs. The goal of the penalty is to give the next attempt a
>> better chance at succeeding. In the happy failure flow, the sender is able
>> to determine the origin of the failure and penalizes a single node or pair
>> of nodes.
>>
>> Unfortunately it is possible for nodes on the route to hide themselves.
>> If they return random data as the failure message, the sender won't know
>> where the failure happened. Some senders then penalize all nodes that were
>> part of the route [4][5]. This may exclude perfectly reliable nodes from
>> being used for future payments. Other senders penalize no nodes at all
>> [6][7], which allows the offending node to keep the disruption going.
>>
>> A special case of this is a final node sending back random data. Senders
>> that penalize all nodes will keep looking for alternative routes. But
>> because each alternative route still ends with that same final node, the
>> sender will ultimately penalize all of its peers and possibly a lot of the
>> rest of the network too.
>>
>> I can think of various reasons for exploiting this weakness. One is just
>> plain grievance for whatever reason. Another one is to attract more traffic
>> by getting competing routing nodes penalized. Or the goal could be to
>> sufficiently mess up reputation tracking of a specific sender node to make
>> it hard for that node to make further payments.
>>
>> Related to this are delays in the path. A node can delay propagating back
>> a failure message and the sender won't be able to determine which node did
>> it.
>>
>> The link at the top of this post [1] describes a way to address both
>> unreadable failure messages as well as delays by letting each node on the
>> route append a timestamp and hmac to the failure message. The great
>> challenge is to do this in such a way that nodes don’t learn their position
>> in the path.
>>
>> I'm revisiting this idea, and have prototyped various ways to implement
>> it. In the remainder of this post, I will describe the variant that I
>> thought works best (so far).
>>
>> # Failure message format
>>
>> The basic idea of the new format is to let each node (not just the error
>> source) commit to the failure message when it passes it back by adding an
>> hmac. The sender verifies all hmacs upon receipt of the failure message.
>> This makes it impossible for any of the nodes to modify the failure message
>> without revealing that they might have played a part in the modification.
>> It won’t be possible for the sender to pinpoint an exact node, because
>> either end of a communication channel may have modified the message.
>> Pinpointing a pair of nodes however is good enough, and is commonly done
>> for regular onion failures too.
>>
>> On the highest level, the new failure message consists of three parts:
>>
>> `message` (var len) | `payloads` (fixed len) | `hmacs` (fixed len)
>>
>> * `message` is the standard onion failure message as described in [2],
>> but without the hmac. The hmac is now part of `hmacs` and doesn't need to
>> be repeated.
>>
>> * `payloads` is a fixed length array that contains space for each node
>> (`hop_payload`) on the route to add data to return to the sender. Ideally
>> the contents and size of `hop_payload` is signaled so that future
>> extensions don’t require all nodes to upgrade. For now, we’ll assume the
>> following 9-byte format:
>>
>>   `is_final` (1 byte) | `duration` (8 bytes)
>>
>>   `is_final` indicates whether this node is the failure source. The
>> sender uses `is_final` to determine when to stop the
>> decryption/verification process.
>>
>>   `duration` is the time in milliseconds that the node held the htlc. By
>> observing the series of reported durations, the sender is able to pinpoint
>> a delay down to a pair of nodes.
>>
>>   The `hop_payload` is repeated 27 times (the maximum route length).
>>
>>   Every hop shifts `payloads` 9 bytes to the right and puts its own
>> `hop_payload` in the 9 left-most bytes.
>>
>> * `hmacs` is a fixed length array where nodes add their hmacs as the
>> failure message travels back to the sender.
>>
>>   To keep things simple, I'll describe the format as if the maximum route
>> length was only three hops (instead of 27):
>>
>>   `hmac_0_2` | `hmac_0_1`| `hmac_0_0`| `hmac_1_1`| `hmac_1_0`| `hmac_2_0`
>>
>>   Because nodes don't know their position in the path, it's unclear to
>> them what part of the failure message they are supposed to include in the
>> hmac. They can't just include everything, because if part of that data is
>> deleted later (to keep the message size fixed) it opens up the possibility
>> for nodes to blame others.
>>
>>   The solution here is to provide hmacs for all possible positions. The
>> last node that updated `hmacs` added `hmac_0_2`, `hmac_0_1` and `hmac_0_0`
>> to the block. Each hmac corresponds to a presumed position in the path,
>> where `hmac_0_2` is for the longest path (2 downstream hops) and `hmac_0_0`
>> for the shortest (node is the error source).
>>
>>   `hmac_x_y` is the hmac added by node x (counted from the node that is
>> currently handling the failure message) assuming that this node is y hops
>> away from the final node.
>>
>> Before an hop adds its hmacs, it first deletes some of the previous
>> hmacs. This keeps the failure message at a fixed length. The removed hmacs
>> are the ones that cannot be useful anymore. If node 0 adds itself, the
>> former node 0 (now node 1) cannot be at the first position anymore. The
>> former node 1 (now node 2) cannot be at the second position anymore. The
>> former node 2 cannot be the source of the error anymore and isn’t
>> represented in the failure message any longer. The corresponding hmacs (the
>> now non-existent `hmac_0_2`, `hmac_1_1` and `hmac_2_0`) are deleted by node
>> 0.
>>
>> Deleting the useless data reduces the number of hmacs (and roughly the
>> total failure message size) to half.
>>
>> The delete operation transform the fields above to:
>>
>> <empty> | <empty> | <empty> | `hmac_0_1`| `hmac_0_0`| `hmac_1_0`
>>
>> The exact data that is included in each hmac is:
>>   * `message`
>>   * the node’s own `hop_payload` and a set of downstream `hop_payload`s,
>> depending on assumed position
>>   * a set of downstream node hmacs, depending on assumed position
>>
>> For example `hmac_0_1` is based on:
>>
>> `message` | `hop_payload[0]` | `hop_payload[1]` | `hmac_1_0`
>>
>> If the node that is currently handling the failure message is one hop
>> away from the final node, it needs to cover its own `hop_payload[0]`, the
>> final node `hop_payload[1]` and the final node hmac `hmac_1_0`.
>>
>> A longer path is committed to in `hmac_0_2`:
>>
>> `message` | `hop_payload[0]` | `hop_payload[1]` | `hop_payload[2]` |
>> `hmac_1_1` | `hmac_2_0`
>>
>> The current node is two hops away from the final node. It needs to cover
>> its own `hop_payload[0]` as well as `hop_payload[1]` and `hop_payload[2]`
>> for the next and final hops. Additionally it covers the next hop `hmac_1_1`
>> and final hop `hmac_2_0`, which correspond to the positions of those nodes
>> in the path that is assumed for `hmac_0_2`.
>>
>> With this information, the sender is able to verify the longest chain of
>> hmacs until it encounters a `hop_payload` with `is_final` set.
>>
>> If any of the nodes messes with any byte in the failure message, the
>> sender is always able to determine a pair of nodes that the offending node
>> is part of. This statement can be verified through reasoning, but to be
>> sure I also tested it with code. I’ve simulated a malicious node that
>> modifies a byte of the failure message at index x and observed the error
>> source as determined by the sender. For every x, the sender reports the
>> same correct pair.
>>
>> # Size
>>
>> The obvious downside of the scheme above is the size. Given a maximum of
>> 27 hops, the `hmacs` block contains 27+26+25+...+1=378 hmacs of 32 bytes
>> each. This makes for a total size of 12 KB.
>>
>> It could be the case though that it is not possible to devise a more
>> compact scheme that also preserves the existing privacy guarantees. I know
>> that smart people have spent time on this problem, but nonetheless no
>> better solution has come up in the past years. A proof of its non-existence
>> would be interesting for sure.
>>
>> I personally think the size increase is justified to fix this
>> vulnerability in Lightning. Also if failures are expected to become more
>> rare going forward, size becomes less relevant to the overall operation of
>> the network.
>>
>> Another option is to reduce the maximum number of hops. It is
>> questionable whether 27 hops are really needed in practice, and such long
>> routes also contribute to latency and capital lock up. If for example the
>> new failure message could only be used with routes up to 10 hops, the total
>> number of hmacs would drop from 378 to 55. This makes for a total message
>> size of about 2 KB.
>>
>> # Signaling
>>
>> For backwards compatibility nodes need to know what algorithm they should
>> run to generate or transform the failure message. This can be signaled by
>> the sender via a tlv onion field. A failure message format signaling
>> mechanism is also discussed in the context of long failure messages [3].
>> The failure message described in this post could be just another version.
>>
>> Additionally, intermediate nodes need to advertise their capability to
>> transform the new format through a feature bit.
>>
>> # Delayed successes
>>
>> It’s not just failures that can be delayed. Successes can too. In that
>> case, there is no failure message to improve. It could be an option to add
>> the same `payloads` and `hmacs` blocks to the `update_fulfill_htlc` message.
>>
>> [1]
>> https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-June/002015.html
>> [2]
>> https://github.com/lightning/bolts/blob/master/04-onion-routing.md#returning-errors
>> [3] https://github.com/lightning/bolts/pull/1021
>> [4]
>> https://github.com/lightningnetwork/lnd/blob/4fbd608b734f348d7e79fbfc7feaecc5c6c33a90/routing/result_interpretation.go#L419
>> [5]
>>
>> https://github.com/ACINQ/eclair/blob/a0433aa0c027c9be618c5afe18e7f91642a7f372/eclair-core/src/main/scala/fr/acinq/eclair/payment/PaymentEvents.scala#L221
>> [6]
>> https://github.com/ElementsProject/lightning/blob/62bfed9a8df8731be44ba4e86afb08a5d28a4442/plugins/libplugin-pay.c#L1461
>> [7]
>> https://github.com/lightningdevkit/rust-lightning/blob/e61f3a238a70cbac87209e223b7c396108a49b97/lightning-invoice/src/payment.rs#L682
>>
>> _______________________________________________
>> Lightning-dev mailing list
>> Lightning-dev@lists.linuxfoundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev
>>
>
On Thu, Oct 20, 2022 at 5:36 PM Bastien TEINTURIER <bast...@acinq.fr> wrote:

> Hi Joost,
>
> I need more time to review your proposed change, but I wanted to quickly
> correct a misunderstanding you had in quoting eclair's code:
>
> > Unfortunately it is possible for nodes on the route to hide themselves.
> > If they return random data as the failure message, the sender won't know
> > where the failure happened. Some senders then penalize all nodes that
> > were part of the route [4][5]. This may exclude perfectly reliable nodes
> > from being used for future payments.
>
> Eclair's code does not penalize nodes for future payment attempts in this
> case. It only ignores them for the retries of that particular payment.
>
> Cheers,
> Bastien
>
> Le mer. 19 oct. 2022 à 13:13, Joost Jager <joost.ja...@gmail.com> a
> écrit :
>
>> Hi list,
>>
>> I wanted to get back to a long-standing issue in Lightning: gaps in error
>> attribution. I've posted about this before back in 2019 [1].
>>
>> Error attribution is important to properly penalize nodes after a payment
>> failure occurs. The goal of the penalty is to give the next attempt a
>> better chance at succeeding. In the happy failure flow, the sender is able
>> to determine the origin of the failure and penalizes a single node or pair
>> of nodes.
>>
>> Unfortunately it is possible for nodes on the route to hide themselves.
>> If they return random data as the failure message, the sender won't know
>> where the failure happened. Some senders then penalize all nodes that were
>> part of the route [4][5]. This may exclude perfectly reliable nodes from
>> being used for future payments. Other senders penalize no nodes at all
>> [6][7], which allows the offending node to keep the disruption going.
>>
>> A special case of this is a final node sending back random data. Senders
>> that penalize all nodes will keep looking for alternative routes. But
>> because each alternative route still ends with that same final node, the
>> sender will ultimately penalize all of its peers and possibly a lot of the
>> rest of the network too.
>>
>> I can think of various reasons for exploiting this weakness. One is just
>> plain grievance for whatever reason. Another one is to attract more traffic
>> by getting competing routing nodes penalized. Or the goal could be to
>> sufficiently mess up reputation tracking of a specific sender node to make
>> it hard for that node to make further payments.
>>
>> Related to this are delays in the path. A node can delay propagating back
>> a failure message and the sender won't be able to determine which node did
>> it.
>>
>> The link at the top of this post [1] describes a way to address both
>> unreadable failure messages as well as delays by letting each node on the
>> route append a timestamp and hmac to the failure message. The great
>> challenge is to do this in such a way that nodes don’t learn their position
>> in the path.
>>
>> I'm revisiting this idea, and have prototyped various ways to implement
>> it. In the remainder of this post, I will describe the variant that I
>> thought works best (so far).
>>
>> # Failure message format
>>
>> The basic idea of the new format is to let each node (not just the error
>> source) commit to the failure message when it passes it back by adding an
>> hmac. The sender verifies all hmacs upon receipt of the failure message.
>> This makes it impossible for any of the nodes to modify the failure message
>> without revealing that they might have played a part in the modification.
>> It won’t be possible for the sender to pinpoint an exact node, because
>> either end of a communication channel may have modified the message.
>> Pinpointing a pair of nodes however is good enough, and is commonly done
>> for regular onion failures too.
>>
>> On the highest level, the new failure message consists of three parts:
>>
>> `message` (var len) | `payloads` (fixed len) | `hmacs` (fixed len)
>>
>> * `message` is the standard onion failure message as described in [2],
>> but without the hmac. The hmac is now part of `hmacs` and doesn't need to
>> be repeated.
>>
>> * `payloads` is a fixed length array that contains space for each node
>> (`hop_payload`) on the route to add data to return to the sender. Ideally
>> the contents and size of `hop_payload` is signaled so that future
>> extensions don’t require all nodes to upgrade. For now, we’ll assume the
>> following 9-byte format:
>>
>>   `is_final` (1 byte) | `duration` (8 bytes)
>>
>>   `is_final` indicates whether this node is the failure source. The
>> sender uses `is_final` to determine when to stop the
>> decryption/verification process.
>>
>>   `duration` is the time in milliseconds that the node held the htlc. By
>> observing the series of reported durations, the sender is able to pinpoint
>> a delay down to a pair of nodes.
>>
>>   The `hop_payload` is repeated 27 times (the maximum route length).
>>
>>   Every hop shifts `payloads` 9 bytes to the right and puts its own
>> `hop_payload` in the 9 left-most bytes.
>>
>> * `hmacs` is a fixed length array where nodes add their hmacs as the
>> failure message travels back to the sender.
>>
>>   To keep things simple, I'll describe the format as if the maximum route
>> length was only three hops (instead of 27):
>>
>>   `hmac_0_2` | `hmac_0_1`| `hmac_0_0`| `hmac_1_1`| `hmac_1_0`| `hmac_2_0`
>>
>>   Because nodes don't know their position in the path, it's unclear to
>> them what part of the failure message they are supposed to include in the
>> hmac. They can't just include everything, because if part of that data is
>> deleted later (to keep the message size fixed) it opens up the possibility
>> for nodes to blame others.
>>
>>   The solution here is to provide hmacs for all possible positions. The
>> last node that updated `hmacs` added `hmac_0_2`, `hmac_0_1` and `hmac_0_0`
>> to the block. Each hmac corresponds to a presumed position in the path,
>> where `hmac_0_2` is for the longest path (2 downstream hops) and `hmac_0_0`
>> for the shortest (node is the error source).
>>
>>   `hmac_x_y` is the hmac added by node x (counted from the node that is
>> currently handling the failure message) assuming that this node is y hops
>> away from the final node.
>>
>> Before an hop adds its hmacs, it first deletes some of the previous
>> hmacs. This keeps the failure message at a fixed length. The removed hmacs
>> are the ones that cannot be useful anymore. If node 0 adds itself, the
>> former node 0 (now node 1) cannot be at the first position anymore. The
>> former node 1 (now node 2) cannot be at the second position anymore. The
>> former node 2 cannot be the source of the error anymore and isn’t
>> represented in the failure message any longer. The corresponding hmacs (the
>> now non-existent `hmac_0_2`, `hmac_1_1` and `hmac_2_0`) are deleted by node
>> 0.
>>
>> Deleting the useless data reduces the number of hmacs (and roughly the
>> total failure message size) to half.
>>
>> The delete operation transform the fields above to:
>>
>> <empty> | <empty> | <empty> | `hmac_0_1`| `hmac_0_0`| `hmac_1_0`
>>
>> The exact data that is included in each hmac is:
>>   * `message`
>>   * the node’s own `hop_payload` and a set of downstream `hop_payload`s,
>> depending on assumed position
>>   * a set of downstream node hmacs, depending on assumed position
>>
>> For example `hmac_0_1` is based on:
>>
>> `message` | `hop_payload[0]` | `hop_payload[1]` | `hmac_1_0`
>>
>> If the node that is currently handling the failure message is one hop
>> away from the final node, it needs to cover its own `hop_payload[0]`, the
>> final node `hop_payload[1]` and the final node hmac `hmac_1_0`.
>>
>> A longer path is committed to in `hmac_0_2`:
>>
>> `message` | `hop_payload[0]` | `hop_payload[1]` | `hop_payload[2]` |
>> `hmac_1_1` | `hmac_2_0`
>>
>> The current node is two hops away from the final node. It needs to cover
>> its own `hop_payload[0]` as well as `hop_payload[1]` and `hop_payload[2]`
>> for the next and final hops. Additionally it covers the next hop `hmac_1_1`
>> and final hop `hmac_2_0`, which correspond to the positions of those nodes
>> in the path that is assumed for `hmac_0_2`.
>>
>> With this information, the sender is able to verify the longest chain of
>> hmacs until it encounters a `hop_payload` with `is_final` set.
>>
>> If any of the nodes messes with any byte in the failure message, the
>> sender is always able to determine a pair of nodes that the offending node
>> is part of. This statement can be verified through reasoning, but to be
>> sure I also tested it with code. I’ve simulated a malicious node that
>> modifies a byte of the failure message at index x and observed the error
>> source as determined by the sender. For every x, the sender reports the
>> same correct pair.
>>
>> # Size
>>
>> The obvious downside of the scheme above is the size. Given a maximum of
>> 27 hops, the `hmacs` block contains 27+26+25+...+1=378 hmacs of 32 bytes
>> each. This makes for a total size of 12 KB.
>>
>> It could be the case though that it is not possible to devise a more
>> compact scheme that also preserves the existing privacy guarantees. I know
>> that smart people have spent time on this problem, but nonetheless no
>> better solution has come up in the past years. A proof of its non-existence
>> would be interesting for sure.
>>
>> I personally think the size increase is justified to fix this
>> vulnerability in Lightning. Also if failures are expected to become more
>> rare going forward, size becomes less relevant to the overall operation of
>> the network.
>>
>> Another option is to reduce the maximum number of hops. It is
>> questionable whether 27 hops are really needed in practice, and such long
>> routes also contribute to latency and capital lock up. If for example the
>> new failure message could only be used with routes up to 10 hops, the total
>> number of hmacs would drop from 378 to 55. This makes for a total message
>> size of about 2 KB.
>>
>> # Signaling
>>
>> For backwards compatibility nodes need to know what algorithm they should
>> run to generate or transform the failure message. This can be signaled by
>> the sender via a tlv onion field. A failure message format signaling
>> mechanism is also discussed in the context of long failure messages [3].
>> The failure message described in this post could be just another version.
>>
>> Additionally, intermediate nodes need to advertise their capability to
>> transform the new format through a feature bit.
>>
>> # Delayed successes
>>
>> It’s not just failures that can be delayed. Successes can too. In that
>> case, there is no failure message to improve. It could be an option to add
>> the same `payloads` and `hmacs` blocks to the `update_fulfill_htlc` message.
>>
>> [1]
>> https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-June/002015.html
>> [2]
>> https://github.com/lightning/bolts/blob/master/04-onion-routing.md#returning-errors
>> [3] https://github.com/lightning/bolts/pull/1021
>> [4]
>> https://github.com/lightningnetwork/lnd/blob/4fbd608b734f348d7e79fbfc7feaecc5c6c33a90/routing/result_interpretation.go#L419
>> [5]
>>
>> https://github.com/ACINQ/eclair/blob/a0433aa0c027c9be618c5afe18e7f91642a7f372/eclair-core/src/main/scala/fr/acinq/eclair/payment/PaymentEvents.scala#L221
>> [6]
>> https://github.com/ElementsProject/lightning/blob/62bfed9a8df8731be44ba4e86afb08a5d28a4442/plugins/libplugin-pay.c#L1461
>> [7]
>> https://github.com/lightningdevkit/rust-lightning/blob/e61f3a238a70cbac87209e223b7c396108a49b97/lightning-invoice/src/payment.rs#L682
>>
>> _______________________________________________
>> Lightning-dev mailing list
>> Lightning-dev@lists.linuxfoundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev
>>
>

_______________________________________________
Lightning-dev mailing list
Lightning-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev

Re: [Lightning-dev] Fat Errors

Reply via email to