Re: [freenet-dev] questions about Library for my GSoC project

leuchtkaefer Wed, 22 May 2013 07:32:28 -0700


The ideas I have after reading your email:


I see there are is need to work on Spider, WoT and Library

1) Adapt Spider to:
- Read announcements from WoT too. The announcement of new free sites will be 
connected to the identity (besides, a node can opt to announce in forums or 
other channels). This optimization will help to control spamming identities. 
- We need a policy to assign Freenet URI to the crawlers. Do we have that? For 
example, a node who volunteers to crawl freenet documents may only crawl 
documents which has a key closer to the node ("how closer" is a function 
depending on the number of crawlers). The rationale is that bandwidth is 
reduced and we have different crawls that will be specialized in an key area.   

2) Library:
The file system layer for each node is provided by Library. Each node will have 
an index that permits easy access to its own files and files shared by others 
to that node. The index is a top-layer tree composed by multiple subindex, i.e. 
one for each identity and one for the root. They will only contain manifests of 
the file but not all CHK corresponding to small data blocks (that appear in 
large files). If the file is small (no splitfile), the index can contain 
directly the CHK. 
Not sure if hard links exist in Freenet. But, I think it is possible two make 
two distinct SSK pointing to the same file and signed by with different 
identities. Then, if the same file is share under two different identities, 
both corresponding subindex will contain a SSK pointing to that file.  
All keys under the root means local files not shared.
Keys under a WoT identity can be remote (share to me) or local (shared by me). 
Assuming that WoT relationships are not reciprocal, only the local keys can be 
shared, i.e. I am not allow to share to others a document share to me. Not sure 
if current freenet provide such controls. 

3) WoT: adapt the WoT to include the root of the filesharing subindex in each 
identity. Not having a filesharing index means not sharing anything under that 
identity.

Example:
For instance, I run a node with 5 different identities (tree below). Each 
identity may be associated to different groups (PSK) in Freenet. The id1 is my 
first identity (obtained by the "seed identity"). If I volunteer to run a 
crawler for all public documents, I will publish the key to that filesharing 
index under id1. Notice that is the only case that I am publishing an index 
with remote keys (it is assumed that all crawled documents under this identity 
are "public".
Then suppose that I am a member of the group debian developers using identity 
id2 and I also belong to two other subgroups (Debian OpenSSL Team and Debian 
Kernel Team) using different identities: id4 and id5 . Basically if I want to 
share a file only with the subgroup Debian Kernel I announce the file by 
uploading the SSK (associated with id4) in my WoT subindex. If I want to share 
a file to all members of debian, I use id2 and if I don't want to share the 
file I added to my private root (not uploaded).  

                    node (me)
              _____|_____
     |          |        |
   (public) id1   id2     id3
                    __|__   
                    |       |
                  id4     id5

Is such a thing possible to implement? What do you think? I would like to 
update my GSoC proposal.

Best,
leuchtkaefer


>________________________________
> From: Matthew Toseland <[email protected]>
>To: [email protected]; leuchtkaefer <[email protected]> 
>Sent: Wednesday, May 22, 2013 12:46 PM
>Subject: Re: [freenet-dev] questions about Library for my GSoC project
> 
>
>On Wednesday 22 May 2013 09:30:03 leuchtkaefer wrote:
>> 
>> Hi Ximin,
>> 
>> I am not confusing the Library index with the low level block datastore. 
>> But I still have doubts regarding Library :/  At first, I though the Library 
>> index will act as an inverted index. But this is contrary to Freenet 
>> anonymity goals.
>> 
>> I read that the index contains the utab tree (utab: BTreeMap<URIKey, 
>> BTreeMap<FreenetURI, URIEntry>>) and the freenet uri contains the routing 
>> key:
>> freenet:[KeyType@]RoutingKey,CryptoKey[,n1=v1,n2=v2,...][/docname][/metastring]
>> 
>> Then it is not possible to share the index? 
>> 
>> Besides, only some users run the new spider so I guess only some users have 
>> local indexes and publish some part on Freenet.
>
>A lot of the below will be explaining stuff that you may already understand, 
>please bear with me and ask for clarification on anything you don't follow. 
>Thanks...
>
>An index indexes only the content that has been put into it, not all data on 
>Freenet. With the current code, Spider does the indexing, by following links 
>from one page to another, like an internet search engine. We are simply 
>building an index on top of a block store. It does not include everything on 
>Freenet.
>
>Currently Spider doesn't even use the Web of Trust, so announcing new 
>freesites has to be done by other means, e.g. forums. Once some freesite links 
>to them, Spider will pick them up eventually when it sees that that freesite 
>has updated. We would like Spider to pick up site announcements from forums 
>and/or from the Web of Trust.
>
>For filesharing, we probably want the index to only contain files people have 
>added. For example, we could link an upload to our Web of Trust identity, so 
>when the upload finishes, instead of manually posting it on a forum (or 
>linking it from a freesite), we also automatically add it to our filesharing 
>index, which is exposed through the Web of Trust, and can be searched (or 
>merged) by anyone who has a high enough trust level for our WoT identity. Web 
>of Trust itself is another high-level structure: One node can have many WoT 
>identities or no WoT identities, it's all built from keys.
>> 
>> Who has access to that on-Freenet index? a group of users (PSK) or is it 
>> public for any Freenet user (guess no)? 
>
>Indexes, like anything else on Freenet, are visible to anyone who has the key 
>(in the form of a URL). The top level of an index is currently a USK.
>
>Freenet itself provides only very limited functionality:
>- Fetch a key (as a single file).
>- Insert a single file to a key.
>- Insert a bundle of keys as a "freesite".
>
>There are 4 types of keys:
>- CHK: Content hash key. URI depends on the (encrypted) content.
>- SSK: Signed subspace key. Belongs to a cryptographic identity, so the URI 
>consists of the hash of the public key, and a filename.
>- USK: Updatable subspace key. A messy hack to provide an updatable key based 
>on SSKs. URI consists of the hash of the public key, a filename, and a version 
>number.
>- KSK: Keyword signed key. SSK where the public/private keypair is derived 
>from the keyword. URI consists of a keyword, e.g. [email protected].
>
>There is more detail on these on the wiki.
>
>Everything else is built on top of this at a higher layer, including the Web 
>of Trust, forums/microblogging (FMS, Sone, Freetalk), and searching.
>> 
>> I understand that a user can effectively be the owner of one part/branch of 
>> the top-layer structure and update/modify/delete its part (COW). A top-layer 
>> structure is the "overall vision" of one user composed by pieces published 
>> by multiple users. A top-layer structure is always local (but some subtrees 
>> are links to on Freenet structures).
>
>At present, a user owns an index, and can add to that index. The top level USK 
>includes CHKs pointing to the two trees (by term and by uri). The trees 
>themselves are made of CHKs, so changing a node in the tree requires updating 
>all the nodes above it. But the advantage is that you only need to update 
>those nodes, rather than re-upload the whole tree.
>
>The COW structure means that, in principle, a different user could add to that 
>index in the same manner, creating their own new tree (with new CHK root(s)). 
>This does not affect the first tree, but it shares most of its content with 
>the first tree. And as above, it's relatively cheap, we only need to update 
>the nodes that are changed and all their parents up the tree. The existing 
>code in Library supports this in principle but there is no UI for it at 
>present: Spider uses Library only to update its own index.
>
>When searching, Library (the search box on the homepage) will happily search 
>multiple indexes at once and return the merged results from all of them. 
>That's the limit of the "local overall vision". Individual indexes are 
>normally published so that other people can use them, since running Spider is 
>fairly heavy.
>> 
>> I don't understand why data blocks were included in the index meaning that 
>> the index contains another replica of the data? If that is the case, it is 
>> necessary to replace b-trees for b+tree as it was previously suggested to 
>> remove data and reduce index size.
>
>We do not include *data* in the index. What we do include is:
>- A map from terms (that is, keywords) to the pages including those terms, 
>ranked by relevance, and including the location of the word within the page 
>(i.e. the number of words before that word).
>- A map from url's to basic metadata (title etc).
>> 
>> I am also thinking how to apply bloomfilters to the on-Freenet index. I 
>> didn't check in detailed what is the current support of bloomfilters inside 
>> Freenet. Initially, I understand that bloomfilters are applied for one hop 
>> file request, meaning that bloomfilters are share with neighbors. 
>
>No, at the moment the only use of Bloom filters is to optimise the datastore. 
>This is not relevant to search. Search is at least two layers above the 
>datastore.
>
>The proposal is that an index should link to a bloom filter of all the terms 
>in that index. So then when we do a search over multiple indexes, we can check 
>the bloom filter first (which we will have pre-loaded, so this is instant), 
>and identify that most of the indexes don't contain the term (word) we are 
>looking for. Then we only need to search those that might contain that word.
>> 
>> Thanks a lot for your patience,
>> 
>> leuchtkaefer
>
>

_______________________________________________
Devl mailing list
[email protected]
https://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] questions about Library for my GSoC project

Reply via email to