Re: [freenet-devl] Freenet - The Next Generation : Fuzzy Searching

coderman Sat, 02 Jun 2001 23:04:32 -0700
Ian Clarke wrote:
> 
> Currently a key in Freenet consists, essentially, of a string of text.
> We define a "closeness" function between two keys, so that given three
> keys, A, B, and C, we can determine which of A or B is closer to C.  To
> do this we have chosen a lexographic comparison, so that "aardvark" is
> closer to "apple" than "zebra" is.  It was apparent, even when writing
> my original paper describing the Freenet architecture, that much more
> complex keys could be used with correspondingly more sophisticated
> closeness operations.
> 
> So how does this help up with Fuzzy searching?  Well consider we defined
> a new key-type, called a MetadataKey, which rather than just a single
> string of text, it consisted of a number of key-data pairs, such as:
> 
> "artist" => "tori amos"
> "album"  => "little earthquakes"
> "song"   => "winter"
> "year"   => "1988"
> 
> Now, lets say that we wanted to search for an mp3 which was stored under
> such a key.  We could define a search like this:
> 
> ("artist" string= "tori amos") AND
> ("album" contains "litt") AND
> ("song" contains "w") AND
> ("year" lessThan "2000")
> 
> A node receives this search, and compares it to each of the MetadataKeys
> in its datastore.  It uses fuzzy logic to come up with a value between 0
> and 1 for how closely each MetadataKey matches this search (I have
> thought this out in more detail but for brevity won't explain here,
> do a web search for "fuzzy logic", and "Levenshtein distance"), a
> perfect match being 1.  


You will have to forgive my lack of key knowledge, but I am curious about
the kind of comparisons that could be supported by meta data keys.

In the aardvark example, I am led to believe that the keys are hash values,
and that lexographic distance is determined by numeric distance between the
two.  Meaning, roughly, that a key with all 'aaaaa's would have a numericaly
smaller value than a key with all 'zzzzz's (at least smaller distance from
a base key value, such a 'a' or something like that.

I am also under the impression that these examples assume that a hashing
function is applied to these plain text keys to produce a value that cannot
be reversed to determine original text (at least, not easily, one way hash?)

Now, you mention that meta data keys will contain multiple key,value pairs.
Does this imply that these keys will be variable length, and that any
node can read this key and determine exactly how many key/value pairs it
contains?  And also, is each key the result of a hash operation?

You mention later that meta data keys are not encrypted.  Are they plaintext?
or still the product of a hash function of some type...

I ask this because it leads into my next question.  There are a few operators
in your example, equality, containment, and lessthan/greaterthan.  I can see
how relational operations, such as equal to or less than/ greater than would
be easy to implement.  But what about substrings?  How would you determine
that a given key contains a piece of text (or data of any kind) within it?
Assuming this is a hash key of some type, I am not sure that this is possible.

And finally, how will different kinds of meta data keys be identified?  Is
there a general 'type' of meta data key, lets say 'Music MetaData' key type
so that when comparing subsequent key value pairs the correct assumptions are
used?


> The search request is then forwarded to the
> referece associated with the closest match.  Once the HTL runs out, a
> SearchResponse message is sent which contains, at any given time, the
> top, say, 10 matches for the query, along with the CHK of the data they
> refer to. Each node updates this as they pass the search request back to
> the requester.  The requester can then choose which of these they wish
> to request using conventional Freenet messages.
>
> Of course, we do lose some nice features which normal Freenet messages
> have here - since node operators can now read the metadata in their
> datastores, it is no-longer encrypted.  We also lose the randomization
> effect that we get from hashing keys.  However, given that this is only
> search information, is much smaller than most data and therefore
> node-operator censorship is less important since it can be more widely
> distributed, I don't think that this is a serious problem.  Also, this
> mechanism will augment the existing Freenet, not replace anything (the
> datastore for the MetadataKeys can be kept separate).
> 
> Implementation of this will dramatically reduce the functional
> difference between Freenet and something like Google, Napster, or
> Gnutella, while retaining Freenet's scalability.


I guess I am still struggling to understand the details of the meta data
keys.  Are these plain text?  And if so, would it be preferable to use some
kind of quasi-standard meta data format (perhaps a lite RDF)?


Best regards,
    Martin Peck.

_______________________________________________
Devl mailing list
[EMAIL PROTECTED]
http://lists.freenetproject.org/mailman/listinfo/devl
Re: [freenet-devl] Freenet - The Next Generation : Fuzzy Searching

Reply via email to