Hi Vincent,
On 6 Aug 2023, at 21:35, Vincenzo Palazzo wrote:
I do not see the privacy concern here for two main reasons:
The dray run will be done on a selected node (who wants to run it), so
I am assuming that with my research node, I am not buying my house (or
maybe a coffee).
I'm confused: wasn't the idea to collect data from real-world forwarding
nodes rather than creating yet another synthetic/research data set?
The only leak of privacy that I see is for the following fields, but
they are meaningless from the analysis point of view so that we can
fake is (I will do this in core lightning) you need just to make sure
that the fake channel id is always the same, right?
* channel_in (uint64)[P]: the short channel ID of the incoming
channel
that forwarded the HLTC.
* channel_out (uint64)[P]: the short channel ID of the outgoing
channel that forwarded the HTLC.
* peer_in (hex string)[P]: the hex encoded pubkey of the remote peer
for the channel_in.
* peer_out (hex_string)[P]: the hex encoded pubkey of the remote peer
for the channel_out.
I care about this point to make this research result 100% reproducible
by giving access to raw data to involve more people in verifying and
proving that we are wrong.
Due to the limitation of the data that should come from real nodes
with real bitcoin involved, we can fall in the situation that we leave
in our bubble of certainty.
I am missing something in the point that we can fake the channel_id
and
the node pub key?
Sure you can obfuscate these fields, but that doesn't mean it's not
possible to re-identify node ids and channels by correlating the dataset
with publicly available data, such as the graph topology and gossip
data.
Just to throw out some ideas: you could for example assume that over a
sufficiently long collection period each channel of a node will
eventually be used and show up in the dataset, i.e., you get a good
approximation of the number of channels the observation point has with
its neighbors. This alone might already be enough to give a good guess
which obfuscated node id corresponds to which node in the network. If we
now use the timestamps we can further exclude any nodes/channels that
couldn't have been used at the time the HTLC was sent from the candidate
set, and, especially if we have access to datasets from neighboring
nodes, we may be able to easily derive which anonymized clusters
correspond to which real world clusters. You could find neighboring
nodes by checking that all `ts_added_ns` timestamps between two
candidates are sufficiently close together (i.e., that no additional hop
would "fit in there" assuming a reasonable real-world RTT). Once we have
re-identified which obfuscated nodes are which real-world nodes, we
could derive HTLC amount from the gathered fees, and can draw
conclusions about the liquidities. We then even may use the HTLC
resolution time delta to draw some further conclusions on the
network-distance of the HTLC destination. Of course, all of these are
estimations, so the adversary has some error probability in there, and
fuzzing the timestamps might already go a long making the adversary's
life harder.
P.S: I had some real examples in the university that I came from of
PhD program failed to start due to the lack of real data
Yes, as said before I'm very familiar with trying to do Lightning
research in absence of real-world data sets :)
To be clear I'm not objecting this effort, just saying a) that sharing
aggregated results is probably a good starting point and b) that the
framework and the associated risks of the data collection should be
clearly communicated beforehand to node operators considering sharing
their data.
Best,
Elias
_______________________________________________
Lightning-dev mailing list
Lightning-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev