Re: [freenet-dev] Datastore probing

Matthew Toseland Tue, 08 Apr 2014 19:48:40 -0700

On 05/04/14 04:55, Robert Hailey wrote:
> On 2014/04/04 (Apr), at 5:29 PM, Matthew Toseland wrote:
>
>> On 04/04/14 23:10, Robert Hailey wrote:
>>> On 2014/04/04 (Apr), at 2:16 PM, Matthew Toseland wrote:
>>>
>>> In my opinion, much of this almost-reached-the-data stuff (bloom filters, 
>>> data probes, FOAF, etc) serves to hide some deep bug or design flaw; that 
>>> is, to make a broken system usable.
>>>
>> There are good reasons for the "brokenness". Datastores are not all the
>> same size; the sink node for the key you are inserting might have a
>> rather small store. Or it might be offline, although we try to deal with
>> that to some degree.
> At this point, can we really presume that the insert makes it to the node 
> with the nearest address? Most of what I have heard to this point is that the 
> data will end up stuck in the path towards the address.
I don't see any reason to think that routing is severely broken.
Incoming requests tend to be very specialised (which is a good thing),
and backoff is relatively low at any given time. On the other hand it's
possible that load management breaks things badly.
> It would be great if true, because that would narrow the field of possible 
> problems to "the target node". Such as if it changes location, goes offline, 
> etc.
Right. Typically we *should* store on 3 nodes, but we don't know; there
are some proposals for special keys/requests to find out without risking
real content. Also, it won't change location unless it's darknet, but it
might well go offline.
>> But it might still be cached nearby.
> That "might" is what I'm referring to by the brokenness, or being kludgy. 
> Although... it would be interesting to know if the average cache rollover 
> time directly corresponds to the 2 week window.
It *will* be cached along the path. But the turnover is high.
>>> Passing INSERTs only to veteran nodes & having a "outer ring" of 
>>> connectedness (only applicable to opennet) might fix the lower level issues.
>> Then what do you do with all the other nodes?
> It would be better to say that the insert should stick to a veteran node, the 
> only way I know how to do that is in routing, but maybe it can be judged as 
> the insert backs out?
>
> WRT "the other nodes", there are a *LOT* more get requests that inserts, so 
> even if it did make the traffic a bit lopsided I don't see that it would 
> necessarily be a bad thing.
So you are saying the majority of nodes should never receive inserts,
and thus should only cache data which passes through them as requests?
IMHO that would be a very bad thing. Especially as it conflicts with my
plans to have high HTL requests and inserts go to "core nodes". But more
generally, it's vital that inserts and requests go down the same route -
otherwise what's the point of all those extra useless hops? Relaying the
data through more nodes because they *might* have it cached doesn't make
much sense.
>> Are you suggesting that we
>> go with the original proposal where every node which isn't high
>> bandwidth, high uptime, is transient?
> So long as transient != leaf-node... except I don't think it's bandwidth is 
> particularly important in terms of finding data.
Transient = low uptime, no routing, does not serve requests, only uses
bandwidth for its own sake.
Normal node = routes requests and inserts, stores data, doesn't see
"high htl" requests/inserts.
Core node = sees "high htl" requests/inserts as well. Uptime and
bandwidth requirements.
>> IMHO requesting from unreliable nodes is perfectly acceptable
> GET-ing from unreliable nodes is fine... we can just move on to the next 
> node, INSERT-ing into a transient/faulty/new node might cause big problems!
It wastes hops and therefore bandwidth.


If the problem is "what if we store on a low uptime node", the current
solution is to take that into account when deciding whether we are a
sink node; if the next hop is closer to the target than we are, but is
low uptime, we ignore it and store it anyway.
>> ...provided we are below the HTL threshold (so that MAST attackers still 
>> have to put in the resources to become
>> "core" nodes).
> I think that it is reasonable to say that servicing GET requests could earn 
> you enough reputation to handle an INSERT (or other veteran status perks), as 
> with a GET there is a bit of proof that the node did the work you requested.
>
> However, it does *not* sound so reasonable to presume that handling INSERTs 
> would gain you a reputation, because there is no proof the peer did 
> anything... surly it would be somewhat easy to instantly respond to every 
> INSERT with a good/happy/done status?
Right. Verifying inserts is hard.
>>> It's a bit curious, but intriguing, that you mention aggregating the data 
>>> probes... seems kinda like hinting to your neighbors: "I'm working on X, Y, 
>>> & Z... any leads?"... esp. If we need some packet padding anyway... then, 
>>> well... in the common end case *most* of our neighbors will have already 
>>> seen the request (right?)... so I'm not sure how much this buys us.
>> Why would they already have seen the request? The request goes to ~ 25
>> nodes. Each of them has up to 100 neighbours. Even if the topology is
>> poor they can't all have been visited already.
> All that I mean is that:
> (1) until we get near the target address, we have no particular reason to 
> "expect" that nodes adjacent to the path have the data (so if they have it, 
> it is only "luck"), and
If it's very popular we will find it in a few hops. If it's middle-range
popular, there's a good chance it'll be cached on an adjacent node. Even
if it's very unpopular, there is luck; maybe somebody nearby was
involved in a request some time ago and has a big store. And we're only
talking about 32 bytes for a key here. So how do we quantify whether
it's worth it? Simulations and common sense say bloom filter sharing
should improve performance significantly - this is a similar mechanism
with a different kind of cost.
> (2) once we are near the target address, the algorithim is predominately to 
> search/find the data (so... it should work?).
It should work if the node didn't fall off the network, didn't have a
very small store, etc. Sure.
> Said another way, I have no doubt that this makes it more effective 
> (statistically.. more nodes, etc), but it does not add to the "emergent 
> qualities" of the software/network.
If it boosts performance and doesn't cost us any security then it's
worthwhile AFAICS?
>>> Hmm... what if.... whenever an item drops from the cache, if it is a small 
>>> chk (i.e. one data packet, not a superblock) we turn it into a FAST_INSERT 
>>> (a one-shot insert, no thread/state required)... you just drain your cache 
>>> back into the network/store?
>>>
>> The capacity of the network is finite. Every time we drop a key we
>> insert the lost key -> the nodes that store it insert their lost keys ->
>> load balloons real fast.
> You may have misunderstood... I am specifically not talking about data being 
> dropped from the store, only the cache.
>
> What we want is the data to find it's way back to the correct *store*, but I 
> presume that most of the current network's effectiveness is in the *cache*
Why? The average store size would seem to be roughly consistent with the
survival time statistics.
>  (and the fact that we search so many nodes).
There are 10,000 nodes. A typical search goes through 25 of them. So
something must be working.
> I'm talking about a *very* lightweight request, and one of the lowest 
> priority... droppable, even. If the receiver has the datum in the cache (or 
> store), the request can die... or, the receiver could just stick the datum in 
> it's cache (no more work, no response)... or finally, it can just shoot the 
> request towards the target address (maybe qualified as an insert-capable 
> peer).
>
> WRT to ballooning load, the only metric greatly effected would be bandwidth. 
> So, if you like, we can even tag our cache with "who we got the data from", 
> to make sure we don't just send it back the same route... further emphasizing 
> the "healing" nature of this stateless/threadless request.... akin to a UDP 
> packet.
So one request causes 20 nodes to add to their caches. Each one of them
fires off a mini-insert, which can be relayed to many nodes. That still
sounds like ballooning load to me!
> You could even qualify it such that the request must always make forward 
> progress (closer peer than current node's location). In fact, I kinda like 
> that idea... having a message that doesn't need an HTL technically reveals 
> that much less information.
Keeping the best visited location so far gives away a lot of
information. We tried something like this a long time ago.
>>>> Do we need even more "put it back where it should be" mechanisms? E.g.
>>>> if a sink node finds the data from a store probe it should store it?
>>>> Would this help with censorship attacks?
>>> Wouldn't that require a *huge* lookup table (or bloom filter?!) ala 
>>> RecentlyFailed.
>> To quench reinserts for popular keys when they are already running
>> because everyone is requesting the same key and 1/100th of them are
>> reinserting it? That could take some memory, or we could even put it on
>> disk, yes. OTOH it's not really needed judging by the insert:request
>> ratio at the moment?
> I would much rather a solution that is stateless and contemporaneous... e.g. 
> on GET or cache invalidation.... not having to guess or remember what's 
> needed.
I think we are talking about different things here? IIRC the above was
"reinsert data when we find it in the wrong place" ... So e.g. a request
which is fulfilled quickly would propagate all the way until it finds a
store containing the data. If it's not found, we push it to the store.
But for popular keys this will be very, very wasteful. Then I got
thinking that what we do now (randomly reinsert 1 in 100 successful
requests) has similar issues. The duplication doesn't seem to be a
problem at the moment, it might be with some mechanisms...
>>>> The average store turnover is unknown but guesswork suggests it's around
>>>> 2 weeks ...
>>> That sounds quite disagreeable. I'd much rather the original design goal of 
>>> a nearly-infinite vat of nearly-permanent storage. :-)
>> Right. But the question is, is the poor data retention simply due to the
>> ratio of new stuff being put in to the size of people's datastores? If
>> so, there's not much we can do, short of bigger storage requirements or
>> even slower inserts.
> My uneducated reaction would be "surly not"... as I presume that there are 
> (and always will be) more consumers than producers. It's true that even an 
> idle node does make some inserts (ARK)... but that's about it.
I don't know. Can we quantify this? We can look at average store size
and store write rates on a typical node ... Which is what I did when
giving the above ballpark estimates ...
> I wonder if that's a possible attack vector (inserting noise).
Inserting it and then requesting it would be more effective. But your
request points would eventually converge on your insert point.
>>> Do you mean "time until the data is not reachable", or a node's stores 
>>> *actually* getting so much data that they are rolling over?
>> The sinks for a given key shouldn't change much. We see ~ 2 weeks data
>> retention for the data being findable. The question is, is this because
>> stores are rolling over quickly (because they are small and have a
>> relatively large number of inserts), or is it because of routing/uptime
>> issues e.g. the sink nodes for the key are all offline, can't be reached
>> due to load problems etc?
>>
>> This is testable: We just need to probe the average store rollover time!
> I concur, and would add cache too. Not sure how you would measure that (per 
> key removedTime-insertedTime?).
Calculate it from the write rate and the store size. Then quantise it
and add some noise, as with all such stats.
> Supposing for a moment that the data stores are errantly filling up with 
> far-away-addressed data, or that the swap/address-randomization was the true 
> problem... what would be the best way to detect that?
The storage-behaviour-probe keys I mentioned above maybe? For it to be a
plausible hypothesis maybe you should find out whether it's true of your
own node first? Or how you might measure it?

PS Locations don't change on opennet. Or they shouldn't ... I need to
check that.

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Devl mailing list
Devl@freenetproject.org
https://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Datastore probing

Reply via email to