Hi Elias, I do not see the privacy concern here for two main reasons: The dray run will be done on a selected node (who wants to run it), so I am assuming that with my research node, I am not buying my house (or maybe a coffee). The only leak of privacy that I see is for the following fields, but they are meaningless from the analysis point of view so that we can fake is (I will do this in core lightning) you need just to make sure that the fake channel id is always the same, right?
>* channel_in (uint64)[P]: the short channel ID of the incoming channel > that forwarded the HLTC. > >* channel_out (uint64)[P]: the short channel ID of the outgoing > channel that forwarded the HTLC. > >* peer_in (hex string)[P]: the hex encoded pubkey of the remote peer > for the channel_in. > >* peer_out (hex_string)[P]: the hex encoded pubkey of the remote peer > for the channel_out. I care about this point to make this research result 100% reproducible by giving access to raw data to involve more people in verifying and proving that we are wrong. Due to the limitation of the data that should come from real nodes with real bitcoin involved, we can fall in the situation that we leave in our bubble of certainty. I am missing something in the point that we can fake the channel_id and the node pub key? P.S: I had some real examples in the university that I came from of PhD program failed to start due to the lack of real data Cheers, Vincent. On Aug 3 2023, at 10:54 am, Elias Rohrer <l...@tnull.de> wrote: > > Hi Carla + Clara, > > I want to prefix this by saying that I'm very familiar with how limiting the > lack of available real-world datasets can be for conducting significant > simulations and empirical experiments on Lightning. > However, it may be noteworthy that long-term collection of the proposed > fields could potentially allow to re-identify the anonymized channel > counterparties based off some heuristics correlating with the public graph > data, especially when datasets from multiple (possibly neighbouring) > collection points will end up being combined. Subsequently, this might allow > to draw further conclusions on transferred amounts, channel liquidities at > particular times, and, as HTLC settlement/failure timestamps are recorded in > nanosecond resolution, potentially even the payment destination's identity > (cf. 1 (https://arxiv.org/pdf/2006.12143.pdf)). > As surrendering this kind of data therefore requires a good level of trust in > the researchers, it might be helpful (and best practise) if you could clarify > upfront whether you intend to time-box the collection period, where the data > would be stored, and who would have access to it. From my point of view > clearly defining the collection period would also be mandatory as we don't > want to incentivise node operators to collect and store HTLC data > longer-term, especially if it's to this degree of detail. > Best, > Elias > > ### 1. Collect Anonymized Data > > We're aware that we are dealing with sensitive and private information. > > For this reason, we propose defining a common data format so that > > analysis tooling can be built around, so that node operators can run > > the analysis locally if desired. Fields marked with [P] *MUST* be > > randomized if exported to researching teams. > > > > > > The proposed format is a CSV file with the following fields: > > * version (uint8): set to 1, included to future-proof ourselves > > against the need to change this format. > > * channel_in (uint64)[P]: the short channel ID of the incoming channel > > that forwarded the HLTC. > > * channel_out (uint64)[P]: the short channel ID of the outgoing > > channel that forwarded the HTLC. > > * peer_in (hex string)[P]: the hex encoded pubkey of the remote peer > > for the channel_in. > > * peer_out (hex_string)[P]: the hex encoded pubkey of the remote peer > > for the channel_out. > > * fee_msat(uint64): the fee offered by the HTLC, expressed in msat. > > * outgoing_liquidity (float64): the portion of > > `max_htlc_value_in_flight` that is occupied on channel_out after the > > HTLC has been forwarded. > > * outgoing_slots (float64): the portion of `max_accepted_htlcs` that > > is occupied on channel_out after the HTLC has been forwarded. > > * ts_added_ns (uint64): the unix timestamp that the HTLC was added, > > expressed in nanoseconds. > > * ts_removed_ns (uint64): the unix timestamp that the HLTC was > > removed, expressed in nanoseconds. > > * htlc_settled (bool): set to 0 if the HTLC failed, and 1 if it was > > settled. > > * incoming_endorsed (int16): an integer indicating the endorsement > > status of the incoming HTLC (-1 if not present, otherwise set to the > > value in the incoming endorsement TLV). > > * outgoing_endorsed (int16): an integer indicating the endorsement > > status of the outgoing HTLC (-1 if not set, otherwise set to the > > value set in the outgoing endorsement TLV). > > > > > > Before we add endorsement signaling and setting via an experimental > > TLV, the last two values here will always be -1. The data is still > > incredibly useful in the meantime, and allows for easy update once the > > > > > > TLV is propagated through the network. > > > > _______________________________________________ > Lightning-dev mailing list > Lightning-dev@lists.linuxfoundation.org > https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev
_______________________________________________ Lightning-dev mailing list Lightning-dev@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev