----- Original Message ----- From: "Niklas Bergh" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, August 18, 2003 7:33 AM Subject: Re: [freenet-dev] Search Indexing Round 2
> > > On Sun, 2003-08-17 at 18:24, Scott Young wrote: > > > > I think it would be better if the client application calculated the > > > > scoring. I assume that the 'weight' you mention here is somehow > > > > calculated from where in the page the word was found and how > 'valuable' > > > > in the given page the word is and so on.. > > > > > > The "weight" could be calculated by any means the index-publisher wants, > > > but it should generally indicate the relevance of a certain page to a > > > certain keyword. > > > > > > > > > > I would recommend that the index file contained this information > instead > > > > <Information Domain> <Relative position> <KEY> > > > > > > So basically you're saying more metadata should be stored on these index > > > pages so that better queries can be done. > > Yes, but the index pages should only contain metadata that can be considered > as belonging to the word (as in the example above) OK, I withdraw the above. Information domains could be defined elsewhere and refered to by the <information domain> field. Now the the obvious question becomes; Should the words in every information domain be fully indexed :)? I am not so sure that this is a good idea.. I still think that all the 'words', from every domain should be located in a common place (like the files named 'index' files in previous mails). > >I can see two ways that we > > > can handle orthogonal metadata: include it with the data in a particular > > > index, or include it in it's own separate index that uses the mechanism > > > above. For example, if you want to have a song search engine, you could > > > have an index for the name of the song, and another index for the > > > artists. Orthoganal metadata like Genre and bitrate could be stored > > > along with the entries instead of in their own indexes. If they were > > > stored in their own indexes, then the page for "128 kbps" would be > > > unacceptably HUGE. > > > > > > This is starting to look like a database. Databases need less storage > > > space if they are normalized. With multiple indexes, pages could be > > > stored like this: > > > > > > > > > [EMAIL PROTECTED]/mySearch/keys/keys1 > > Sorry... my message got loose before it was done (Evolution is acting > > up). > > > > > > as I was saying, you could have a listing of every page that your search > > system indexes, split across several files. For example > > > > [EMAIL PROTECTED]/mySearch/keys/keys1 > > would contain > > 1 "[EMAIL PROTECTED]/ItDontMeanAThing.mp3" "Jazz" "Ella Fitzgerald" > > 2 "[EMAIL PROTECTED]/Help.mp3" "Rock" "The Beatles" > > ... > > (Primitive) partitioning indeed, your ideas are starting to catch up with > with newer RDBMS functionallities. > > > The first number is just an index number. It is used so that any other > > indexes only need to store that index number and its weight (which means > > saved space when a certain file is indexed in multiple places.) Other > > data that is on a 1-to-1 correspondence with the key can also be put > > here. > > > > The "keys1" page would contain entries 1 through 100, "keys2" would > > contain 101 through 200, etc. > > > > An index like this: > > [EMAIL PROTECTED]/mySearch/indexes/artist > > would contain pages that contain information about specific artists. > > > > For example: > > [EMAIL PROTECTED]/mySearch/indexes/artist/Beatles > > could list: > > 2 10 > > 15 10 > > 30 10 > > 34234 10 > > 545 10 > > where the first number refers to the index number of a song in the > > "keys" directory, and the second number is the weight. When querying > > the Beatles page, the search engine would then request pages 1, 5, and > > 342 (which contain the keys for the songs listed in this particular > > index). > > > > > > There could also be an index for the song title. A user could say "Give > > me every Beatles song named Help that is at least 160 kpbs." The search > > engine could then come up with a query plan. The query plan would > > probably be to search the artist index for "Beatles," then look in the > > Song Title index for "Help," take the intersection, take the resulting > > key index numbers and look them up in the key pages, and then filter > > those the results on the bitrate. > > > > > > > > I guess what I'm getting at is more than just a search capability. It > > could actually work as a database on freenet, albeit a high-latency > > one. Searching would be the immediately obvious application of a > > database system, but there might be other later uses. > > Actual content indexing *isn't* an obvious application of an RDBMS, however, > the storage structure of a generated index (index as in content index, not > as in RDBMS index) fits very well into an RDBMS > > > So the big question is, who wants to write an RDBMS over Freenet so > > Freenet can get some really good searching capability? > > Let me just tie this in with the searching discussion. > What we need of an RDMS to implement the search system that we want is: > 1. A word-index (originally located at [EMAIL PROTECTED]/mySearch/<word>). Used for > content indexing/locating > which resources that contains a certain word > 2. Resource metadata (possibly partitioned into multiple > [EMAIL PROTECTED]/mySearch/keys/keysX). Contains metadata > for the different resources referred anywhere in the 'kit' (examples could > be file name, size, bitrate and so on) > 3. A number of alternative indexes (like for instance > [EMAIL PROTECTED]/mySearch/indexes/artist). > To enable access to items by different metadata > 4. A metadata file describing which of the indexes and keysX files (and > possibly words) that are present in the 'db'. > > I have a feeling every file except the 'Resource' files would compress very > well.. Intelligent use of containers would probably be good.. > > /N > > > > > > > _______________________________________________ > devl mailing list > [EMAIL PROTECTED] > http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl > _______________________________________________ devl mailing list [EMAIL PROTECTED] http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl
