Re: [freenet-dev] Project Proposal for Google Summer of Code 2009

Matthew Toseland Thu, 02 Apr 2009 06:35:10 -0700

On Thursday 02 April 2009 01:46:08 Jai Sharma wrote:
> Dear Developers,
> 
> Please excuse me for my ignorance in the routing matters in my earlier
> proposal. It's a busy time at the college and I didn't have time to go
> through all the Freenet papers, but that's totally my problem.
> 
> But after helpful comments from Sdiz and Matthew I did go through whatever I
> could get my eyes on and here is my current understanding of how CHKs work,
> I will need your help correcting me if I am wrong anywhere.
> 
> As it stands:
> 
> 1)CHKs are already split if the file size is greater than 1kb into 32kbs
> chunks.
> 
> 2)For files greater than 32k the CHK itself refers to the manifest file
> which has a structure similar to the torrent file, in the sense, it is a
> list of hashes for the pieces of the file.


Yes.
> 
> 3)The routing inside the Freenet ensures, to some extent, that these chunks
> are stored close to other chunks close in KEYSPACE and not in semantic
> space.

Yes.
> 
> 4)I assume this is what Sdiz meant by bittorrent queueing being advantageous
> that semantically close pieces of a file are stored, with the nodes
> identified by the tracker and multiple pieces can be downloaded from the
> same node holding them. Instead of searching for each individual piece and
> downloading it from the individual owner as is done in Freenet. But since in
> freenet, the chunks are not stored semantically close as in bittorrent,
> inefficiencies while dowloading them creep in, even when the file is being
> downloaded from multiple sources.

Sure, but there are advantages too.
> 
> Again you guys have to help me here, while downloading individual chunks of
> a file, if by polling for chunks Sdiz means that, say a file has 1000
> chunks, so for 1000 CHKs from the manifest file it sends a search request
> for EACH ONE of them INDIVIDUALLY and get the data back as a result.

Yes.
> 
> If this is the current algorithm, then I think it can be improved by first
> sorting the keys for chunks of a file by their key values, so that when we
> group them later each individual group has keys close in key space.  Then we
> divide the search into say 10 requests with 100 keys in each of them, which
> are close in key space due to the sort from last step and then send these 10
> requests instead of sending the 1000 requests.
> 
> On receiving such a request a node will check to see if it has data for any
> of the keys in the packet if yes it will send the data item back to the
> requesting node and delete entry from the group for the data items that it
> has sent and then forward rest of the group for remaining data items to be
> downloaded.
> 
> The advantage this approach(10 requests) has over the previous(1000
> requests) approach is that we are saving the cost(time) of the forward path
> for multiple requests which end up reaching the same cluster in the network,
> as in this new approach there will just be a single request going to the
> cluster containing the set of required keys as compared to multiple requests
> in the previous approach.

The problem is that the requests will not actually reach the "same cluster", 
because their keys are different, some of them wouldn't have been routed in 
the same way. Routing to the wrong nodes overwhelms any possible savings from 
any other source IMHO. Also it has major ramifications for load management: 
the current load management code deals with individual requests and imposes 
delays between them to limit the load. In terms of security, you win some you 
lose some - an attacker can see that the keys are related, but on the other 
hand he gets fewer predecessor samples.
> 
> Also certain nodes can be made responsible for getting part of the data and
> retrying till it gets it, instead of the original node retrying it, which
> kind of distributes the network bandwidth and the processing power required
> for the download to multiple nodes. This retry is more efficient as it will
> be done by a node in the cluster close to the required data item rather than
> the original node which might be too far for doing it again.

This would be a big problem for load management.
> 
> Please bear with me if I am making no sense, I just want to add value to the
> proposal instead of making a proposal just for the proposal's sake. Because
> otherwise the project is all about implementing a new UI for the downloading
> files and I see no scope for differentiation there.

New UI for downloading files is completely unrelated to changing the routing 
and request semantics.
> 
> If you think it makes even a little sense then I can put it in the proposal
> and submit it to google after getting a review from you guys.

To consider as a whole:

IMHO it is important that routing decisions are made for each key separately. 
The reason for this is that routing a key to the wrong node will cause it to 
not find the data, or to take a lot more hops to find it. It will cause more 
load overall, and generally cause problems.

Right now our load management and request semantics deal with requests 
individually. The request originator sends requests at a rate determined by 
observing the round-trip times and the likelihood of timeouts and rejections 
(which are propagated back to it), and we decide whether to accept each 
request individually, and backoff when a node rejects requests. Requests are 
expected to be reasonably fast, are not queued on nodes (they are answered, 
forwarded if it is possible, or they fail) and do not persist once they have 
failed.

For security reasons, we avoid bursts, i.e. we aim for approximately constant 
rates of transfer and requests, and hope nodes are up 24x7. This does of 
course mean we cannot compete on performance with bursty networks such as 
Perfect Dark (ignoring its minimum bandwidth and disk requirements!), but as 
far as I can see it is unavoidable given our security goals: a big spike 
routed across several hops will not only make it obvious to a traffic 
observer that one particular node is responsible, but if the attacker has 
compromised nodes close to that node, he can directly connect the burst of 
traffic to the node to the actual keys requested i.e. completely eliminate 
any anonymity on the requestor's side; and cryptographic tunnels don't even 
solve this problem!

However, we could conceivably burst when starting requests (bulk requests, not 
realtime requests), provided that the returning data is not bursted. This 
would require new load management, but would allow us to forward requests in 
large bundles, and thus improve efficiency, similar to what you have 
suggested. My current thinking on this is:
- It clearly depends on the realtime vs bulk flag. See below. We need to mark 
requests as either realtime (forward asap, don't queue, short timeouts, low 
throughput in the sense of not accepting many of these), or bulk (queue until 
we have a good route, long timeouts, high throughput in the sense of 
accepting many requests).
- It also depends on some new form of load management, likely some variant on 
token passing (see other mails).
- Security-wise, starting lots of requests at once may give away that those 
requests are related, which may be useful if the attacker wants to trace 
stuff without knowing what it is, but generally an attacker that powerful 
will be able to break us in lots of other ways. It may not make much 
difference... On the other hand, if requests can take a long time to 
complete, we are vulnerable to nodes "playing the system", locking up 
resources for a long time without answering anything...
- Low-risk or less paranoid connections could transfer data at higher rates, 
and ignore the security issues.

Major relevant future developments that we already have planned:
- Some indication of whether a request is realtime or is bulk; if the former, 
it must complete quickly, even if that means routing is poor, and there will 
be a limited capacity for this type; if the latter, it may take a while to 
complete, but there will be accurate routing (perhaps including queueing), 
and high throughput because many of this type will be accepted 
simultaneously.
- Passive requests: A request fails (DNFs), and stick around, waiting for the 
data it searched for. The request will reroute itself from time to time, 
depending on when we lose or gain connections, node locations change, and so 
on. The actual rerouting can generally be done in bundles, because there will 
be lots of passive requests pending, and so we can efficiently group them; 
but each request is routed individually. This will need a new load management 
system, because clearly a passive request has ongoing costs, and thus we have 
a limited capacity for them that must be shared out sensibly, and 
prioritised.
- Persistent requests: Similar to passive requests, but the originator can 
actually disconnect, provided there is sufficient trust (probably related to 
darknet); when found, the data will trickle back towards the originator one 
hop at a time, waiting for reconnects where necessary. This makes darknets 
(pure friend-to-friend networks) composed of low-uptime nodes (i.e. laptops) 
feasible.

IMHO neither of the latter two items is suitable for a GSoC project, they are 
core stuff and are large and challenging projects, currently planned for 
0.10.
> 
> I appreciate all your help.
> 
> Thanks and Regards,
> Jai Sharma
> 
> 
> 
> On Tue, Mar 31, 2009 at 1:05 PM, Matthew Toseland <[email protected]
> > wrote:
> 
> > On Monday 30 March 2009 09:21:36 Daniel Cheng wrote:
> > > Comment inline
> > >
> > > 2009/3/30 Jai Sharma <[email protected]>:
> > > > Dear Freenet Developer,
> > > >
> > > > Please find attached my proposal for the project :
> > > >
> > > > "Develop a File Sharing System, accessible from inside the Freenet web
> > > > interface"
> > > >
> > > > Please feel free to send back any suggestions.
> > >
> > > Would be easier to comment if you parse the propose inline (i.e. not
> > > an attachment)
> >
> > Also, when you are happy with your proposal, you MUST submit it to the
> > google
> > soc webapp: http://socghop.appspot.com/
> >
> > We cannot accept it from the mailing list.
> > >
> > >
> > > > I propose the following implementation:
> > > >  - Files with size less than a threshold will be downloaded directly
> > apiece from the CHK.
> > > >  - For other files the Content Hash Key (CHKs) instead of identifying
> > the
> > file itself, will
> > > >    identify the torrent file.
> > > >  - The torrent file will have the list of CHKs for the pieces of the
> > bigger file which will be
> > > >    downloaded simultaneously from multiple peers to have higher
> > download
> > speeds.
> > >
> > > Before going into the details, I would like to check a few facts with
> > you:
> > >
> > >   1) Currently: Large files in CHK@ spitted in smaller chunks, they
> > > are called "split files".
> > >
> > >   2) split file chunks are distributed all over the network. No body
> > > own the whole file.
> > >
> > >   3) nor do you know who own the chunks -- you have to keep polling
> > > from the location.
> > >
> > >   4) As a content-addressable network, files are downloaded from
> > > multiple nodes already.
> > >
> > >   5) One of the BT advantage is the tracker and queuing,  which is
> > > impossible to implement (efficiently) without changing the routing
> > > details.
> > >
> > > So, does your proposal take care of the above facts?
> >
> > In general I agree with sdiz here: your proposal is a little unclear, are
> > you
> > expecting to have a central (or distributed) bittorrent tracker or are you
> > simply implementing multi-source download *within freenet* (which we
> > already
> > have) ?
> >
>

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Devl mailing list
[email protected]
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Project Proposal for Google Summer of Code 2009

Reply via email to