On Thursday 02 April 2009 01:46:08 Jai Sharma wrote: > Dear Developers, > > Please excuse me for my ignorance in the routing matters in my earlier > proposal. It's a busy time at the college and I didn't have time to go > through all the Freenet papers, but that's totally my problem. > > But after helpful comments from Sdiz and Matthew I did go through whatever I > could get my eyes on and here is my current understanding of how CHKs work, > I will need your help correcting me if I am wrong anywhere. > > As it stands: > > 1)CHKs are already split if the file size is greater than 1kb into 32kbs > chunks. > > 2)For files greater than 32k the CHK itself refers to the manifest file > which has a structure similar to the torrent file, in the sense, it is a > list of hashes for the pieces of the file.
Yes. > > 3)The routing inside the Freenet ensures, to some extent, that these chunks > are stored close to other chunks close in KEYSPACE and not in semantic > space. Yes. > > 4)I assume this is what Sdiz meant by bittorrent queueing being advantageous > that semantically close pieces of a file are stored, with the nodes > identified by the tracker and multiple pieces can be downloaded from the > same node holding them. Instead of searching for each individual piece and > downloading it from the individual owner as is done in Freenet. But since in > freenet, the chunks are not stored semantically close as in bittorrent, > inefficiencies while dowloading them creep in, even when the file is being > downloaded from multiple sources. Sure, but there are advantages too. > > Again you guys have to help me here, while downloading individual chunks of > a file, if by polling for chunks Sdiz means that, say a file has 1000 > chunks, so for 1000 CHKs from the manifest file it sends a search request > for EACH ONE of them INDIVIDUALLY and get the data back as a result. Yes. > > If this is the current algorithm, then I think it can be improved by first > sorting the keys for chunks of a file by their key values, so that when we > group them later each individual group has keys close in key space. Then we > divide the search into say 10 requests with 100 keys in each of them, which > are close in key space due to the sort from last step and then send these 10 > requests instead of sending the 1000 requests. > > On receiving such a request a node will check to see if it has data for any > of the keys in the packet if yes it will send the data item back to the > requesting node and delete entry from the group for the data items that it > has sent and then forward rest of the group for remaining data items to be > downloaded. > > The advantage this approach(10 requests) has over the previous(1000 > requests) approach is that we are saving the cost(time) of the forward path > for multiple requests which end up reaching the same cluster in the network, > as in this new approach there will just be a single request going to the > cluster containing the set of required keys as compared to multiple requests > in the previous approach. The problem is that the requests will not actually reach the "same cluster", because their keys are different, some of them wouldn't have been routed in the same way. Routing to the wrong nodes overwhelms any possible savings from any other source IMHO. Also it has major ramifications for load management: the current load management code deals with individual requests and imposes delays between them to limit the load. In terms of security, you win some you lose some - an attacker can see that the keys are related, but on the other hand he gets fewer predecessor samples. > > Also certain nodes can be made responsible for getting part of the data and > retrying till it gets it, instead of the original node retrying it, which > kind of distributes the network bandwidth and the processing power required > for the download to multiple nodes. This retry is more efficient as it will > be done by a node in the cluster close to the required data item rather than > the original node which might be too far for doing it again. This would be a big problem for load management. > > Please bear with me if I am making no sense, I just want to add value to the > proposal instead of making a proposal just for the proposal's sake. Because > otherwise the project is all about implementing a new UI for the downloading > files and I see no scope for differentiation there. New UI for downloading files is completely unrelated to changing the routing and request semantics. > > If you think it makes even a little sense then I can put it in the proposal > and submit it to google after getting a review from you guys. To consider as a whole: IMHO it is important that routing decisions are made for each key separately. The reason for this is that routing a key to the wrong node will cause it to not find the data, or to take a lot more hops to find it. It will cause more load overall, and generally cause problems. Right now our load management and request semantics deal with requests individually. The request originator sends requests at a rate determined by observing the round-trip times and the likelihood of timeouts and rejections (which are propagated back to it), and we decide whether to accept each request individually, and backoff when a node rejects requests. Requests are expected to be reasonably fast, are not queued on nodes (they are answered, forwarded if it is possible, or they fail) and do not persist once they have failed. For security reasons, we avoid bursts, i.e. we aim for approximately constant rates of transfer and requests, and hope nodes are up 24x7. This does of course mean we cannot compete on performance with bursty networks such as Perfect Dark (ignoring its minimum bandwidth and disk requirements!), but as far as I can see it is unavoidable given our security goals: a big spike routed across several hops will not only make it obvious to a traffic observer that one particular node is responsible, but if the attacker has compromised nodes close to that node, he can directly connect the burst of traffic to the node to the actual keys requested i.e. completely eliminate any anonymity on the requestor's side; and cryptographic tunnels don't even solve this problem! However, we could conceivably burst when starting requests (bulk requests, not realtime requests), provided that the returning data is not bursted. This would require new load management, but would allow us to forward requests in large bundles, and thus improve efficiency, similar to what you have suggested. My current thinking on this is: - It clearly depends on the realtime vs bulk flag. See below. We need to mark requests as either realtime (forward asap, don't queue, short timeouts, low throughput in the sense of not accepting many of these), or bulk (queue until we have a good route, long timeouts, high throughput in the sense of accepting many requests). - It also depends on some new form of load management, likely some variant on token passing (see other mails). - Security-wise, starting lots of requests at once may give away that those requests are related, which may be useful if the attacker wants to trace stuff without knowing what it is, but generally an attacker that powerful will be able to break us in lots of other ways. It may not make much difference... On the other hand, if requests can take a long time to complete, we are vulnerable to nodes "playing the system", locking up resources for a long time without answering anything... - Low-risk or less paranoid connections could transfer data at higher rates, and ignore the security issues. Major relevant future developments that we already have planned: - Some indication of whether a request is realtime or is bulk; if the former, it must complete quickly, even if that means routing is poor, and there will be a limited capacity for this type; if the latter, it may take a while to complete, but there will be accurate routing (perhaps including queueing), and high throughput because many of this type will be accepted simultaneously. - Passive requests: A request fails (DNFs), and stick around, waiting for the data it searched for. The request will reroute itself from time to time, depending on when we lose or gain connections, node locations change, and so on. The actual rerouting can generally be done in bundles, because there will be lots of passive requests pending, and so we can efficiently group them; but each request is routed individually. This will need a new load management system, because clearly a passive request has ongoing costs, and thus we have a limited capacity for them that must be shared out sensibly, and prioritised. - Persistent requests: Similar to passive requests, but the originator can actually disconnect, provided there is sufficient trust (probably related to darknet); when found, the data will trickle back towards the originator one hop at a time, waiting for reconnects where necessary. This makes darknets (pure friend-to-friend networks) composed of low-uptime nodes (i.e. laptops) feasible. IMHO neither of the latter two items is suitable for a GSoC project, they are core stuff and are large and challenging projects, currently planned for 0.10. > > I appreciate all your help. > > Thanks and Regards, > Jai Sharma > > > > On Tue, Mar 31, 2009 at 1:05 PM, Matthew Toseland <[email protected] > > wrote: > > > On Monday 30 March 2009 09:21:36 Daniel Cheng wrote: > > > Comment inline > > > > > > 2009/3/30 Jai Sharma <[email protected]>: > > > > Dear Freenet Developer, > > > > > > > > Please find attached my proposal for the project : > > > > > > > > "Develop a File Sharing System, accessible from inside the Freenet web > > > > interface" > > > > > > > > Please feel free to send back any suggestions. > > > > > > Would be easier to comment if you parse the propose inline (i.e. not > > > an attachment) > > > > Also, when you are happy with your proposal, you MUST submit it to the > > google > > soc webapp: http://socghop.appspot.com/ > > > > We cannot accept it from the mailing list. > > > > > > > > > > I propose the following implementation: > > > > - Files with size less than a threshold will be downloaded directly > > apiece from the CHK. > > > > - For other files the Content Hash Key (CHKs) instead of identifying > > the > > file itself, will > > > > identify the torrent file. > > > > - The torrent file will have the list of CHKs for the pieces of the > > bigger file which will be > > > > downloaded simultaneously from multiple peers to have higher > > download > > speeds. > > > > > > Before going into the details, I would like to check a few facts with > > you: > > > > > > 1) Currently: Large files in CHK@ spitted in smaller chunks, they > > > are called "split files". > > > > > > 2) split file chunks are distributed all over the network. No body > > > own the whole file. > > > > > > 3) nor do you know who own the chunks -- you have to keep polling > > > from the location. > > > > > > 4) As a content-addressable network, files are downloaded from > > > multiple nodes already. > > > > > > 5) One of the BT advantage is the tracker and queuing, which is > > > impossible to implement (efficiently) without changing the routing > > > details. > > > > > > So, does your proposal take care of the above facts? > > > > In general I agree with sdiz here: your proposal is a little unclear, are > > you > > expecting to have a central (or distributed) bittorrent tracker or are you > > simply implementing multi-source download *within freenet* (which we > > already > > have) ? > > >
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Devl mailing list [email protected] http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
