On Sun, 2003-08-17 at 18:24, Scott Young wrote: > > I think it would be better if the client application calculated the > > scoring. I assume that the 'weight' you mention here is somehow > > calculated from where in the page the word was found and how 'valuable' > > in the given page the word is and so on.. > > The "weight" could be calculated by any means the index-publisher wants, > but it should generally indicate the relevance of a certain page to a > certain keyword. > > > > I would recommend that the index file contained this information instead > > <Information Domain> <Relative position> <KEY> > > So basically you're saying more metadata should be stored on these index > pages so that better queries can be done. I can see two ways that we > can handle orthogonal metadata: include it with the data in a particular > index, or include it in it's own separate index that uses the mechanism > above. For example, if you want to have a song search engine, you could > have an index for the name of the song, and another index for the > artists. Orthoganal metadata like Genre and bitrate could be stored > along with the entries instead of in their own indexes. If they were > stored in their own indexes, then the page for "128 kbps" would be > unacceptably HUGE. > > This is starting to look like a database. Databases need less storage > space if they are normalized. With multiple indexes, pages could be > stored like this: > > > [EMAIL PROTECTED]/mySearch/keys/keys1
Sorry... my message got loose before it was done (Evolution is acting up). as I was saying, you could have a listing of every page that your search system indexes, split across several files. For example [EMAIL PROTECTED]/mySearch/keys/keys1 would contain 1 "[EMAIL PROTECTED]/ItDontMeanAThing.mp3" "Jazz" "Ella Fitzgerald" 2 "[EMAIL PROTECTED]/Help.mp3" "Rock" "The Beatles" ... The first number is just an index number. It is used so that any other indexes only need to store that index number and its weight (which means saved space when a certain file is indexed in multiple places.) Other data that is on a 1-to-1 correspondence with the key can also be put here. The "keys1" page would contain entries 1 through 100, "keys2" would contain 101 through 200, etc. An index like this: [EMAIL PROTECTED]/mySearch/indexes/artist would contain pages that contain information about specific artists. For example: [EMAIL PROTECTED]/mySearch/indexes/artist/Beatles could list: 2 10 15 10 30 10 34234 10 545 10 where the first number refers to the index number of a song in the "keys" directory, and the second number is the weight. When querying the Beatles page, the search engine would then request pages 1, 5, and 342 (which contain the keys for the songs listed in this particular index). There could also be an index for the song title. A user could say "Give me every Beatles song named Help that is at least 160 kpbs." The search engine could then come up with a query plan. The query plan would probably be to search the artist index for "Beatles," then look in the Song Title index for "Help," take the intersection, take the resulting key index numbers and look them up in the key pages, and then filter those the results on the bitrate. I guess what I'm getting at is more than just a search capability. It could actually work as a database on freenet, albeit a high-latency one. Searching would be the immediately obvious application of a database system, but there might be other later uses. So the big question is, who wants to write an RDBMS over Freenet so Freenet can get some really good searching capability? _______________________________________________ devl mailing list [EMAIL PROTECTED] http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl
