Re: [freenet-dev] questions about Library for my GSoC project

Matthew Toseland Wed, 22 May 2013 11:41:27 -0700

On Wednesday 22 May 2013 15:32:03 leuchtkaefer wrote:
> 
> The ideas I have after reading your email:
> 
> I see there are is need to work on Spider, WoT and Library
> 
> 1) Adapt Spider to:
> - Read announcements from WoT too. The announcement of new free sites will be 
> connected to the identity (besides, a node can opt to announce in forums or 
> other channels). This optimization will help to control spamming identities. 
> - We need a policy to assign Freenet URI to the crawlers. Do we have that? 
> For example, a node who volunteers to crawl freenet documents may only crawl 
> documents which has a key closer to the node ("how closer" is a function 
> depending on the number of crawlers). The rationale is that bandwidth is 
> reduced and we have different crawls that will be specialized in an key area. 
>   
> 
> 2) Library:
> The file system layer for each node is provided by Library. Each node will 
> have an index that permits easy access to its own files and files shared by 
> others to that node. The index is a top-layer tree composed by multiple 
> subindex, i.e. one for each identity and one for the root. They will only 
> contain manifests of the file but not all CHK corresponding to small data 
> blocks (that appear in large files). If the file is small (no splitfile), the 
> index can contain directly the CHK. 
> Not sure if hard links exist in Freenet. But, I think it is possible two make 
> two distinct SSK pointing to the same file and signed by with different 
> identities. Then, if the same file is share under two different identities, 
> both corresponding subindex will contain a SSK pointing to that file.  
> All keys under the root means local files not shared.
> Keys under a WoT identity can be remote (share to me) or local (shared by 
> me). Assuming that WoT relationships are not reciprocal, only the local keys 
> can be shared, i.e. I am not allow to share to others a document share to me. 
> Not sure if current freenet provide such controls. 
> 
> 3) WoT: adapt the WoT to include the root of the filesharing subindex in each 
> identity. Not having a filesharing index means not sharing anything under 
> that identity.
> 
> Example:
> For instance, I run a node with 5 different identities (tree below). Each 
> identity may be associated to different groups (PSK) in Freenet. The id1 is 
> my first identity (obtained by the "seed identity"). If I volunteer to run a 
> crawler for all public documents, I will publish the key to that filesharing 
> index under id1. Notice that is the only case that I am publishing an index 
> with remote keys (it is assumed that all crawled documents under this 
> identity are "public".
> Then suppose that I am a member of the group debian developers using identity 
> id2 and I also belong to two other subgroups (Debian OpenSSL Team and Debian 
> Kernel Team) using different identities: id4 and id5 . Basically if I want to 
> share a file only with the subgroup Debian Kernel I announce the file by 
> uploading the SSK (associated with id4) in my WoT subindex. If I want to 
> share a file to all members of debian, I use id2 and if I don't want to share 
> the file I added to my private root (not uploaded).  
> 
>                     node (me)
>               _____|_____
>      |          |        |
>    (public) id1   id2     id3
>                     __|__   
>                     |       |
>                   id4     id5
> 
> Is such a thing possible to implement? What do you think? I would like to 
> update my GSoC proposal.
> 
> Best,
> leuchtkaefer
> 
Partly. There is still some confusion here.


Currently, indexes don't "contain" SSKs: Indexes consist of a USK/SSK at the 
top, which points to btrees. The btree nodes are stored as CHKs. They contain 
pointers (URLs) to the indexed files, which can of course be SSKs.

When a file is uploaded, we either upload it as a CHK or an SSK. Ultimately 
every large file is split up into CHKs, but they are treated differently 
depending on whether there is a CHK or an SSK at the top: If we upload as a 
CHK, the encryption key for the file is derived from the content of the file, 
so anyone uploading the file will get the same set of blocks. The actual top 
key also depends on the MIME type and filename (but this is a bug we plan to 
fix eventually). This is ideal for filesharing. Unfortunately there are 
security issues with uploading predictable blocks. So if we upload as an SSK, 
we use a random encryption key, resulting in a different set of CHKs. For 
security reasons this is recommended.

We agree that we should have one index per WoT identity. And the same person 
can have many different identities, some for different roles, and some 
completely separate; generally we shouldn't be able to tell that two identities 
belong to the same person, unless they tell us.

WoT identities have a trust level between 0 and 100 (0 being spammer) for every 
known identity. This is either set explicitly (in which case it may or may not 
be published), or is computed from others' published trust levels. It is 
unidirectional.

Don't worry about PSKs yet. They don't exist yet, and making them work well 
will be a lot of work on much lower layers than this stuff.

IMHO the key functionality here is to define an index format for filesharing, 
implement some UI for maintaining your per-identity index, and a tool to search 
the indexes of all the WoT identities we know about that have a high enough 
trust level. Once that core functionality is sorted out, there's a lot that can 
be done to optimise it further, and make it scale better: Bloom filters, 
merging indexes, optimising the tree (e.g. b*tree), possibly avoiding the tree 
for really popular terms, and so on.

I suggest the format should be:
- Top level: A USK that points to the tree.
- Tree level: A btree, similar to that used by Library.
- Within a single term, there are many single files; Spider uses a sub-tree to 
sort them by relevance, I'm not sure that will work for filesharing, maybe it's 
just a list.
- Single file: The URL for the file (CHK, SSK, etc), the file size and hashes 
(these are easily extracted from the file without downloading it fully), maybe 
other stuff like description, link to a thumbnail or preview etc. Public key of 
the original uploader's WoT identity, and a signature. (For anti-spam purposes, 
e.g. so we can still filter out blacklisted uploaders when we're using a merged 
index)

If you are interested in distributed spidering (for HTML), that would also be a 
great project. There is somebody else working on it, but I don't think much has 
been seen from him for a while. He does have a separate search index system for 
HTML, which is more efficient but less scalable IIRC. IMHO a good filesharing 
search system is more urgent.

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Devl mailing list
[email protected]
https://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] questions about Library for my GSoC project

Reply via email to