On Sun, Dec 31, 2000, Adam Langley wrote:
> On Sun, Dec 31, 2000 at 02:02:31AM -0800, Aaron Voisine wrote:
> > Here's my idea for keyword searches. I volunteer to implement it.
>
> That's very nice of you, and please don't be put off - but you
> utterly misunderstand much of Freenet.
>
> > Each node keeps an index of the ksk's it has in it's data store. When
it =
> > receives a Request.Search
>
> KSKs are binary by the time they hit a node.
>

That is a big drawback :)

Clients have to insert keyword indices for keyword searches to work
(doh!).  The two methods mentioned in David Wentzlaff's (w/Likuo Lin &
Alexander Yip) searching paper at

http://cag.lcs.mit.edu/~wentzlaf/classes/6.899/project/public/doc/

are Enumeration and Summaries.  There is also mentioning of Lightweight
Indirect Files (LIFs) in that paper, but I think the Summaries method is a
client side implementation of LIFs.  The problem that the Summaries method
tries to solve is how to create an inverted index of keywords to keys.  
The solution presented was to create an inverted index daily, so that only
the newest keywords are indexed since the last summary was created.  I
think this is a reasonable property, since the most recent files are often
the most relevant and also have the highest probability of being cached by
Freenet.  Client side indexing implementations of searching have another
property that is desirable, popular keywords will have indices that are
the most up to date and the most widely cached.

I think I would combine the notions of Enumeration and Summaries into a
single index that is updated like an Enumeration but ages like a Summary.  
Lets say a document has keywords mp3, metallica.  Updating these keywords
would be performed like the Enumeration method, a binary search on the
keyword (search for mp3.0, mp3.100, mp3.200...).  For each collision on
the keyword, keep a list of unique keys sorted by a timestamp.  When an
unused Enumeration is found, insert the top N keywords (sorted by
timestamp) along with the new keyword. So, every Enumeration will contain
an estimate of the N newest references:

mp3.209:
my_timestamp     my_keyref
older_timetamp   older_keyref
oldest_timetamp  oldest_keyref
...

metallica.55:    
my_timestamp     my_keyref
older_timetamp   older_keyref
oldest_timetamp  oldest_keyref
...

The timestamps have another property that may be useful.  If you search
for something like "metallica and mp3", the timestamps give you a
direction vector for finding keys that are indexed in both metallica and
in mp3.  For example, if my indices contain:

mp3.209
timestamp=20001230 keyA
timestamp=20001227 keyM
timestamp=20001225 keyB

metallica.55
timestamp=20001227 keyM
timestamp=20001219 keyY
timestamp=20001125 keyZ

I can ascertain from the timestamp that keyA probably does not have a
matching entry in metallica.54 or metallica.53 (with some fuzziness
applied) but it may have an entry metallica.56.  I think the timestamps
should be sufficiently coarse grained (day or even week granularity) to
prevent compromising anonymity when combined with traffic analysis.

If I were writing a search client (and I may try to simulate such a
thing), I would dedicate some disk space for caching keyword inverted
indices since a large number of keywords can fit into a pretty small area
and would have a tremendous impact on speed if you use the same keywords
in different searches.





_______________________________________________
Freenet-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/mailman/listinfo/freenet-dev

Reply via email to