On Sun, Aug 17, 2003 at 07:03:24PM -0400, Scott Young wrote:
> On Sun, 2003-08-17 at 18:24, Scott Young wrote:
> > > I think it would be better if the client application calculated the
> > > scoring. I assume that the 'weight' you mention here is somehow
> > > calculated from where in the page the word was found and how 'valuable'
> > > in the given page the word is and so on..
> > 
> > The "weight" could be calculated by any means the index-publisher wants,
> > but it should generally indicate the relevance of a certain page to a
> > certain keyword.
> > 
> > 
> > > I would recommend that the index file contained this information instead
> > > <Information Domain> <Relative position> <KEY>
> > 
> > So basically you're saying more metadata should be stored on these index
> > pages so that better queries can be done.  I can see two ways that we
> > can handle orthogonal metadata: include it with the data in a particular
> > index, or include it in it's own separate index that uses the mechanism
> > above.  For example, if you want to have a song search engine, you could
> > have an index for the name of the song, and another index for the
> > artists.  Orthoganal metadata like Genre and bitrate could be stored
> > along with the entries instead of in their own indexes.  If they were 
> > stored in their own indexes, then the page for "128 kbps" would be
> > unacceptably HUGE.
> > 
> > This is starting to look like a database.  Databases need less storage
> > space if they are normalized.  With multiple indexes, pages could be
> > stored like this:
> > 
> > 
> > [EMAIL PROTECTED]/mySearch/keys/keys1
> 
> Sorry... my message got loose before it was done (Evolution is acting
> up).
> 
> 
> as I was saying, you could have a listing of every page that your search
> system indexes, split across several files.  For example
> 
> [EMAIL PROTECTED]/mySearch/keys/keys1
> would contain
> 1 "[EMAIL PROTECTED]/ItDontMeanAThing.mp3" "Jazz" "Ella Fitzgerald"
> 2 "[EMAIL PROTECTED]/Help.mp3" "Rock" "The Beatles"
> ...
> 
> 
> The first number is just an index number.  It is used so that any other
> indexes only need to store that index number and its weight (which means
> saved space when a certain file is indexed in multiple places.) Other
> data that is on a 1-to-1 correspondence with the key can also be put
> here.
> 
> The "keys1" page would contain entries 1 through 100, "keys2" would
> contain 101 through 200, etc.
> 
> An index like this:
> [EMAIL PROTECTED]/mySearch/indexes/artist
> would contain pages that contain information about specific artists.  
> 
> For example:
> [EMAIL PROTECTED]/mySearch/indexes/artist/Beatles
> could list:
> 2 10
> 15 10
> 30 10
> 34234 10
> 545 10
> where the first number refers to the index number of a song in the
> "keys" directory, and the second number is the weight.  When querying
> the Beatles page, the search engine would then request pages 1, 5, and
> 342 (which contain the keys for the songs listed in this particular
> index).

I wouldn't. Compress it if you like, but freenet is lossy, remember?
> 
> 
> There could also be an index for the song title.  A user could say "Give
> me every Beatles song named Help that is at least 160 kpbs."  The search
> engine could then come up with a query plan.  The query plan would
> probably be to search the artist index for "Beatles," then look in the
> Song Title index for "Help," take the intersection, take the resulting
> key index numbers and look them up in the key pages, and then filter
> those the results on the bitrate.
> 
> 
> 
> I guess what I'm getting at is more than just a search capability.  It
> could actually work as a database on freenet, albeit a high-latency
> one.  Searching would be the immediately obvious application of a
> database system, but there might be other later uses.
> 
> So the big question is, who wants to write an RDBMS over Freenet so
> Freenet can get some really good searching capability?
> 
> 
> 
> 
> 
> _______________________________________________
> devl mailing list
> [EMAIL PROTECTED]
> http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl

-- 
Matthew J Toseland - [EMAIL PROTECTED]
Freenet Project Official Codemonkey - http://freenetproject.org/
ICTHUS - Nothing is impossible. Our Boss says so.

Attachment: pgp00000.pgp
Description: PGP signature

_______________________________________________
Devl mailing list
[EMAIL PROTECTED]
http://dodo.freenetproject.org/cgi-bin/mailman/listinfo/devl

Reply via email to