Re: [freenet-dev] questions about Library for my GSoC project

Matthew Toseland Wed, 22 May 2013 03:46:28 -0700

On Wednesday 22 May 2013 09:30:03 leuchtkaefer wrote:
> 
> Hi Ximin,
> 
> I am not confusing the Library index with the low level block datastore. 
> But I still have doubts regarding Library :/  At first, I though the Library 
> index will act as an inverted index. But this is contrary to Freenet 
> anonymity goals.
> 
> I read that the index contains the utab tree (utab: BTreeMap<URIKey, 
> BTreeMap<FreenetURI, URIEntry>>) and the freenet uri contains the routing key:
> freenet:[KeyType@]RoutingKey,CryptoKey[,n1=v1,n2=v2,...][/docname][/metastring]
> 
> Then it is not possible to share the index? 
> 
> Besides, only some users run the new spider so I guess only some users have 
> local indexes and publish some part on Freenet.


A lot of the below will be explaining stuff that you may already understand, 
please bear with me and ask for clarification on anything you don't follow. 
Thanks...

An index indexes only the content that has been put into it, not all data on 
Freenet. With the current code, Spider does the indexing, by following links 
from one page to another, like an internet search engine. We are simply 
building an index on top of a block store. It does not include everything on 
Freenet.

Currently Spider doesn't even use the Web of Trust, so announcing new freesites 
has to be done by other means, e.g. forums. Once some freesite links to them, 
Spider will pick them up eventually when it sees that that freesite has 
updated. We would like Spider to pick up site announcements from forums and/or 
from the Web of Trust.

For filesharing, we probably want the index to only contain files people have 
added. For example, we could link an upload to our Web of Trust identity, so 
when the upload finishes, instead of manually posting it on a forum (or linking 
it from a freesite), we also automatically add it to our filesharing index, 
which is exposed through the Web of Trust, and can be searched (or merged) by 
anyone who has a high enough trust level for our WoT identity. Web of Trust 
itself is another high-level structure: One node can have many WoT identities 
or no WoT identities, it's all built from keys.
> 
> Who has access to that on-Freenet index? a group of users (PSK) or is it 
> public for any Freenet user (guess no)? 

Indexes, like anything else on Freenet, are visible to anyone who has the key 
(in the form of a URL). The top level of an index is currently a USK.

Freenet itself provides only very limited functionality:
- Fetch a key (as a single file).
- Insert a single file to a key.
- Insert a bundle of keys as a "freesite".

There are 4 types of keys:
- CHK: Content hash key. URI depends on the (encrypted) content.
- SSK: Signed subspace key. Belongs to a cryptographic identity, so the URI 
consists of the hash of the public key, and a filename.
- USK: Updatable subspace key. A messy hack to provide an updatable key based 
on SSKs. URI consists of the hash of the public key, a filename, and a version 
number.
- KSK: Keyword signed key. SSK where the public/private keypair is derived from 
the keyword. URI consists of a keyword, e.g. [email protected].

There is more detail on these on the wiki.

Everything else is built on top of this at a higher layer, including the Web of 
Trust, forums/microblogging (FMS, Sone, Freetalk), and searching.
> 
> I understand that a user can effectively be the owner of one part/branch of 
> the top-layer structure and update/modify/delete its part (COW). A top-layer 
> structure is the "overall vision" of one user composed by pieces published by 
> multiple users. A top-layer structure is always local (but some subtrees are 
> links to on Freenet structures).

At present, a user owns an index, and can add to that index. The top level USK 
includes CHKs pointing to the two trees (by term and by uri). The trees 
themselves are made of CHKs, so changing a node in the tree requires updating 
all the nodes above it. But the advantage is that you only need to update those 
nodes, rather than re-upload the whole tree.

The COW structure means that, in principle, a different user could add to that 
index in the same manner, creating their own new tree (with new CHK root(s)). 
This does not affect the first tree, but it shares most of its content with the 
first tree. And as above, it's relatively cheap, we only need to update the 
nodes that are changed and all their parents up the tree. The existing code in 
Library supports this in principle but there is no UI for it at present: Spider 
uses Library only to update its own index.

When searching, Library (the search box on the homepage) will happily search 
multiple indexes at once and return the merged results from all of them. That's 
the limit of the "local overall vision". Individual indexes are normally 
published so that other people can use them, since running Spider is fairly 
heavy.
> 
> I don't understand why data blocks were included in the index meaning that 
> the index contains another replica of the data? If that is the case, it is 
> necessary to replace b-trees for b+tree as it was previously suggested to 
> remove data and reduce index size.

We do not include *data* in the index. What we do include is:
- A map from terms (that is, keywords) to the pages including those terms, 
ranked by relevance, and including the location of the word within the page 
(i.e. the number of words before that word).
- A map from url's to basic metadata (title etc).
> 
> I am also thinking how to apply bloomfilters to the on-Freenet index. I 
> didn't check in detailed what is the current support of bloomfilters inside 
> Freenet. Initially, I understand that bloomfilters are applied for one hop 
> file request, meaning that bloomfilters are share with neighbors. 

No, at the moment the only use of Bloom filters is to optimise the datastore. 
This is not relevant to search. Search is at least two layers above the 
datastore.

The proposal is that an index should link to a bloom filter of all the terms in 
that index. So then when we do a search over multiple indexes, we can check the 
bloom filter first (which we will have pre-loaded, so this is instant), and 
identify that most of the indexes don't contain the term (word) we are looking 
for. Then we only need to search those that might contain that word.
> 
> Thanks a lot for your patience,
> 
> leuchtkaefer

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Devl mailing list
[email protected]
https://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] questions about Library for my GSoC project

Reply via email to