Re: [freenet-dev] Datastore probing

Robert Hailey Fri, 04 Apr 2014 20:55:53 -0700

On 2014/04/04 (Apr), at 5:29 PM, Matthew Toseland wrote:

> On 04/04/14 23:10, Robert Hailey wrote:
>> On 2014/04/04 (Apr), at 2:16 PM, Matthew Toseland wrote:
>> 
>> In my opinion, much of this almost-reached-the-data stuff (bloom filters, 
>> data probes, FOAF, etc) serves to hide some deep bug or design flaw; that 
>> is, to make a broken system usable.


> There are good reasons for the "brokenness". Datastores are not all the
> same size; the sink node for the key you are inserting might have a
> rather small store. Or it might be offline, although we try to deal with
> that to some degree.

At this point, can we really presume that the insert makes it to the node with 
the nearest address? Most of what I have heard to this point is that the data 
will end up stuck in the path towards the address.

It would be great if true, because that would narrow the field of possible 
problems to "the target node". Such as if it changes location, goes offline, 
etc.

> But it might still be cached nearby.

That "might" is what I'm referring to by the brokenness, or being kludgy. 
Although... it would be interesting to know if the average cache rollover time 
directly corresponds to the 2 week window.

>> Passing INSERTs only to veteran nodes & having a "outer ring" of 
>> connectedness (only applicable to opennet) might fix the lower level issues.

> Then what do you do with all the other nodes?

It would be better to say that the insert should stick to a veteran node, the 
only way I know how to do that is in routing, but maybe it can be judged as the 
insert backs out?

WRT "the other nodes", there are a *LOT* more get requests that inserts, so 
even if it did make the traffic a bit lopsided I don't see that it would 
necessarily be a bad thing.

> Are you suggesting that we
> go with the original proposal where every node which isn't high
> bandwidth, high uptime, is transient?

So long as transient != leaf-node... except I don't think it's bandwidth is 
particularly important in terms of finding data.

> IMHO requesting from unreliable nodes is perfectly acceptable

GET-ing from unreliable nodes is fine... we can just move on to the next node, 
INSERT-ing into a transient/faulty/new node might cause big problems!

> ...provided we are below the HTL threshold (so that MAST attackers still have 
> to put in the resources to become
> "core" nodes).

I think that it is reasonable to say that servicing GET requests could earn you 
enough reputation to handle an INSERT (or other veteran status perks), as with 
a GET there is a bit of proof that the node did the work you requested.

However, it does *not* sound so reasonable to presume that handling INSERTs 
would gain you a reputation, because there is no proof the peer did anything... 
surly it would be somewhat easy to instantly respond to every INSERT with a 
good/happy/done status?

>> It's a bit curious, but intriguing, that you mention aggregating the data 
>> probes... seems kinda like hinting to your neighbors: "I'm working on X, Y, 
>> & Z... any leads?"... esp. If we need some packet padding anyway... then, 
>> well... in the common end case *most* of our neighbors will have already 
>> seen the request (right?)... so I'm not sure how much this buys us.

> Why would they already have seen the request? The request goes to ~ 25
> nodes. Each of them has up to 100 neighbours. Even if the topology is
> poor they can't all have been visited already.

All that I mean is that:
(1) until we get near the target address, we have no particular reason to 
"expect" that nodes adjacent to the path have the data (so if they have it, it 
is only "luck"), and
(2) once we are near the target address, the algorithim is predominately to 
search/find the data (so... it should work?).

Said another way, I have no doubt that this makes it more effective 
(statistically.. more nodes, etc), but it does not add to the "emergent 
qualities" of the software/network.

>> Hmm... what if.... whenever an item drops from the cache, if it is a small 
>> chk (i.e. one data packet, not a superblock) we turn it into a FAST_INSERT 
>> (a one-shot insert, no thread/state required)... you just drain your cache 
>> back into the network/store?

> The capacity of the network is finite. Every time we drop a key we
> insert the lost key -> the nodes that store it insert their lost keys ->
> load balloons real fast.

You may have misunderstood... I am specifically not talking about data being 
dropped from the store, only the cache.

What we want is the data to find it's way back to the correct *store*, but I 
presume that most of the current network's effectiveness is in the *cache* (and 
the fact that we search so many nodes).

I'm talking about a *very* lightweight request, and one of the lowest 
priority... droppable, even. If the receiver has the datum in the cache (or 
store), the request can die... or, the receiver could just stick the datum in 
it's cache (no more work, no response)... or finally, it can just shoot the 
request towards the target address (maybe qualified as an insert-capable peer).

WRT to ballooning load, the only metric greatly effected would be bandwidth. 
So, if you like, we can even tag our cache with "who we got the data from", to 
make sure we don't just send it back the same route... further emphasizing the 
"healing" nature of this stateless/threadless request.... akin to a UDP packet.

You could even qualify it such that the request must always make forward 
progress (closer peer than current node's location). In fact, I kinda like that 
idea... having a message that doesn't need an HTL technically reveals that much 
less information.

>>> Do we need even more "put it back where it should be" mechanisms? E.g.
>>> if a sink node finds the data from a store probe it should store it?
>>> Would this help with censorship attacks?
>> Wouldn't that require a *huge* lookup table (or bloom filter?!) ala 
>> RecentlyFailed.
> To quench reinserts for popular keys when they are already running
> because everyone is requesting the same key and 1/100th of them are
> reinserting it? That could take some memory, or we could even put it on
> disk, yes. OTOH it's not really needed judging by the insert:request
> ratio at the moment?

I would much rather a solution that is stateless and contemporaneous... e.g. on 
GET or cache invalidation.... not having to guess or remember what's needed.

>>> The average store turnover is unknown but guesswork suggests it's around
>>> 2 weeks ...
>> That sounds quite disagreeable. I'd much rather the original design goal of 
>> a nearly-infinite vat of nearly-permanent storage. :-)
> Right. But the question is, is the poor data retention simply due to the
> ratio of new stuff being put in to the size of people's datastores? If
> so, there's not much we can do, short of bigger storage requirements or
> even slower inserts.

My uneducated reaction would be "surly not"... as I presume that there are (and 
always will be) more consumers than producers. It's true that even an idle node 
does make some inserts (ARK)... but that's about it.

I wonder if that's a possible attack vector (inserting noise).

>> Do you mean "time until the data is not reachable", or a node's stores 
>> *actually* getting so much data that they are rolling over?
> The sinks for a given key shouldn't change much. We see ~ 2 weeks data
> retention for the data being findable. The question is, is this because
> stores are rolling over quickly (because they are small and have a
> relatively large number of inserts), or is it because of routing/uptime
> issues e.g. the sink nodes for the key are all offline, can't be reached
> due to load problems etc?
> 
> This is testable: We just need to probe the average store rollover time!

I concur, and would add cache too. Not sure how you would measure that (per key 
removedTime-insertedTime?).

-

Supposing for a moment that the data stores are errantly filling up with 
far-away-addressed data, or that the swap/address-randomization was the true 
problem... what would be the best way to detect that?

--
Robert Hailey
_______________________________________________
Devl mailing list
Devl@freenetproject.org
https://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Datastore probing

Reply via email to