RE: [freenet-dev] Search Indexing Round 2

Scott Young Sun, 17 Aug 2003 15:58:58 -0700

On Sun, 2003-08-17 at 18:24, Scott Young wrote:
> > I think it would be better if the client application calculated the
> > scoring. I assume that the 'weight' you mention here is somehow
> > calculated from where in the page the word was found and how 'valuable'
> > in the given page the word is and so on..
> 
> The "weight" could be calculated by any means the index-publisher wants,
> but it should generally indicate the relevance of a certain page to a
> certain keyword.
> 
> 
> > I would recommend that the index file contained this information instead
> > <Information Domain> <Relative position> <KEY>
> 
> So basically you're saying more metadata should be stored on these index
> pages so that better queries can be done.  I can see two ways that we
> can handle orthogonal metadata: include it with the data in a particular
> index, or include it in it's own separate index that uses the mechanism
> above.  For example, if you want to have a song search engine, you could
> have an index for the name of the song, and another index for the
> artists.  Orthoganal metadata like Genre and bitrate could be stored
> along with the entries instead of in their own indexes.  If they were 
> stored in their own indexes, then the page for "128 kbps" would be
> unacceptably HUGE.
> 
> This is starting to look like a database.  Databases need less storage
> space if they are normalized.  With multiple indexes, pages could be
> stored like this:
> 
> 
> [EMAIL PROTECTED]/mySearch/keys/keys1


Sorry... my message got loose before it was done (Evolution is acting
up).


as I was saying, you could have a listing of every page that your search
system indexes, split across several files.  For example

[EMAIL PROTECTED]/mySearch/keys/keys1
would contain
1 "[EMAIL PROTECTED]/ItDontMeanAThing.mp3" "Jazz" "Ella Fitzgerald"
2 "[EMAIL PROTECTED]/Help.mp3" "Rock" "The Beatles"
...


The first number is just an index number.  It is used so that any other
indexes only need to store that index number and its weight (which means
saved space when a certain file is indexed in multiple places.) Other
data that is on a 1-to-1 correspondence with the key can also be put
here.

The "keys1" page would contain entries 1 through 100, "keys2" would
contain 101 through 200, etc.

An index like this:
[EMAIL PROTECTED]/mySearch/indexes/artist
would contain pages that contain information about specific artists.  

For example:
[EMAIL PROTECTED]/mySearch/indexes/artist/Beatles
could list:
2 10
15 10
30 10
34234 10
545 10
where the first number refers to the index number of a song in the
"keys" directory, and the second number is the weight.  When querying
the Beatles page, the search engine would then request pages 1, 5, and
342 (which contain the keys for the songs listed in this particular
index).


There could also be an index for the song title.  A user could say "Give
me every Beatles song named Help that is at least 160 kpbs."  The search
engine could then come up with a query plan.  The query plan would
probably be to search the artist index for "Beatles," then look in the
Song Title index for "Help," take the intersection, take the resulting
key index numbers and look them up in the key pages, and then filter
those the results on the bitrate.



I guess what I'm getting at is more than just a search capability.  It
could actually work as a database on freenet, albeit a high-latency
one.  Searching would be the immediately obvious application of a
database system, but there might be other later uses.

So the big question is, who wants to write an RDBMS over Freenet so
Freenet can get some really good searching capability?





_______________________________________________
devl mailing list
[EMAIL PROTECTED]
http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl

RE: [freenet-dev] Search Indexing Round 2

Reply via email to