Hi Vincent,

On 6 Aug 2023, at 21:35, Vincenzo Palazzo wrote:
I do not see the privacy concern here for two main reasons:
The dray run will be done on a selected node (who wants to run it), so I am assuming that with my research node, I am not buying my house (or maybe a coffee).

I'm confused: wasn't the idea to collect data from real-world forwarding nodes rather than creating yet another synthetic/research data set?

The only leak of privacy that I see is for the following fields, but they are meaningless from the analysis point of view so that we can fake is (I will do this in core lightning) you need just to make sure that the fake channel id is always the same, right?

* channel_in (uint64)[P]: the short channel ID of the incoming channel
that forwarded the HLTC.

* channel_out (uint64)[P]: the short channel ID of the outgoing
channel that forwarded the HTLC.

* peer_in (hex string)[P]: the hex encoded pubkey of the remote peer
for the channel_in.

* peer_out (hex_string)[P]: the hex encoded pubkey of the remote peer
for the channel_out.

I care about this point to make this research result 100% reproducible
by giving access to raw data to involve more people in verifying and
proving that we are wrong.

Due to the limitation of the data that should come from real nodes
with real bitcoin involved, we can fall in the situation that we leave
in our bubble of certainty.

I am missing something in the point that we can fake the channel_id and
the node pub key?

Sure you can obfuscate these fields, but that doesn't mean it's not possible to re-identify node ids and channels by correlating the dataset with publicly available data, such as the graph topology and gossip data.

Just to throw out some ideas: you could for example assume that over a sufficiently long collection period each channel of a node will eventually be used and show up in the dataset, i.e., you get a good approximation of the number of channels the observation point has with its neighbors. This alone might already be enough to give a good guess which obfuscated node id corresponds to which node in the network. If we now use the timestamps we can further exclude any nodes/channels that couldn't have been used at the time the HTLC was sent from the candidate set, and, especially if we have access to datasets from neighboring nodes, we may be able to easily derive which anonymized clusters correspond to which real world clusters. You could find neighboring nodes by checking that all `ts_added_ns` timestamps between two candidates are sufficiently close together (i.e., that no additional hop would "fit in there" assuming a reasonable real-world RTT). Once we have re-identified which obfuscated nodes are which real-world nodes, we could derive HTLC amount from the gathered fees, and can draw conclusions about the liquidities. We then even may use the HTLC resolution time delta to draw some further conclusions on the network-distance of the HTLC destination. Of course, all of these are estimations, so the adversary has some error probability in there, and fuzzing the timestamps might already go a long making the adversary's life harder.

P.S: I had some real examples in the university that I came from of PhD program failed to start due to the lack of real data

Yes, as said before I'm very familiar with trying to do Lightning research in absence of real-world data sets :)

To be clear I'm not objecting this effort, just saying a) that sharing aggregated results is probably a good starting point and b) that the framework and the associated risks of the data collection should be clearly communicated beforehand to node operators considering sharing their data.

Best,

Elias
_______________________________________________
Lightning-dev mailing list
Lightning-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev

Reply via email to