Re: [freenet-dev] Search Indexing Round 2

Niklas Bergh Sun, 17 Aug 2003 23:22:10 -0700

----- Original Message -----
From: "Niklas Bergh" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, August 18, 2003 7:33 AM
Subject: Re: [freenet-dev] Search Indexing Round 2



>
> > On Sun, 2003-08-17 at 18:24, Scott Young wrote:
> > > > I think it would be better if the client application calculated the
> > > > scoring. I assume that the 'weight' you mention here is somehow
> > > > calculated from where in the page the word was found and how
> 'valuable'
> > > > in the given page the word is and so on..
> > >
> > > The "weight" could be calculated by any means the index-publisher
wants,
> > > but it should generally indicate the relevance of a certain page to a
> > > certain keyword.
> > >
> > >
> > > > I would recommend that the index file contained this information
> instead
> > > > <Information Domain> <Relative position> <KEY>
> > >
> > > So basically you're saying more metadata should be stored on these
index
> > > pages so that better queries can be done.
>
> Yes, but the index pages should only contain metadata that can be
considered
> as belonging to the word (as in the example above)

OK, I withdraw the above. Information domains could be defined elsewhere and
refered to by the <information domain> field.
Now the the obvious question becomes; Should the words in every information
domain be fully indexed :)?
I am not so sure that this is a good idea.. I still think that all the
'words', from every domain should be located in a common place
(like the files named 'index' files in previous mails).

> >I can see two ways that we
> > > can handle orthogonal metadata: include it with the data in a
particular
> > > index, or include it in it's own separate index that uses the
mechanism
> > > above.  For example, if you want to have a song search engine, you
could
> > > have an index for the name of the song, and another index for the
> > > artists.  Orthoganal metadata like Genre and bitrate could be stored
> > > along with the entries instead of in their own indexes.  If they were
> > > stored in their own indexes, then the page for "128 kbps" would be
> > > unacceptably HUGE.
> > >
> > > This is starting to look like a database.  Databases need less storage
> > > space if they are normalized.  With multiple indexes, pages could be
> > > stored like this:
> > >
> > >
> > > [EMAIL PROTECTED]/mySearch/keys/keys1
> > Sorry... my message got loose before it was done (Evolution is acting
> > up).
> >
> >
> > as I was saying, you could have a listing of every page that your search
> > system indexes, split across several files.  For example
> >
> > [EMAIL PROTECTED]/mySearch/keys/keys1
> > would contain
> > 1 "[EMAIL PROTECTED]/ItDontMeanAThing.mp3" "Jazz" "Ella Fitzgerald"
> > 2 "[EMAIL PROTECTED]/Help.mp3" "Rock" "The Beatles"
> > ...
>
> (Primitive) partitioning indeed, your ideas are starting to catch up with
> with newer RDBMS functionallities.
>
> > The first number is just an index number.  It is used so that any other
> > indexes only need to store that index number and its weight (which means
> > saved space when a certain file is indexed in multiple places.) Other
> > data that is on a 1-to-1 correspondence with the key can also be put
> > here.
> >
> > The "keys1" page would contain entries 1 through 100, "keys2" would
> > contain 101 through 200, etc.
> >
> > An index like this:
> > [EMAIL PROTECTED]/mySearch/indexes/artist
> > would contain pages that contain information about specific artists.
> >
> > For example:
> > [EMAIL PROTECTED]/mySearch/indexes/artist/Beatles
> > could list:
> > 2 10
> > 15 10
> > 30 10
> > 34234 10
> > 545 10
> > where the first number refers to the index number of a song in the
> > "keys" directory, and the second number is the weight.  When querying
> > the Beatles page, the search engine would then request pages 1, 5, and
> > 342 (which contain the keys for the songs listed in this particular
> > index).
> >
> >
> > There could also be an index for the song title.  A user could say "Give
> > me every Beatles song named Help that is at least 160 kpbs."  The search
> > engine could then come up with a query plan.  The query plan would
> > probably be to search the artist index for "Beatles," then look in the
> > Song Title index for "Help," take the intersection, take the resulting
> > key index numbers and look them up in the key pages, and then filter
> > those the results on the bitrate.
> >
> >
> >
> > I guess what I'm getting at is more than just a search capability.  It
> > could actually work as a database on freenet, albeit a high-latency
> > one.  Searching would be the immediately obvious application of a
> > database system, but there might be other later uses.
>
> Actual content indexing *isn't* an obvious application of an RDBMS,
however,
> the storage structure of a generated index (index as in content index, not
> as in RDBMS index) fits very well into an RDBMS
>
> > So the big question is, who wants to write an RDBMS over Freenet so
> > Freenet can get some really good searching capability?
>
> Let me just tie this in with the searching discussion.
> What we need of an RDMS to implement the search system that we want is:
> 1. A word-index (originally located at [EMAIL PROTECTED]/mySearch/<word>). Used for
> content indexing/locating
> which resources that contains a certain word
> 2. Resource metadata (possibly partitioned into multiple
> [EMAIL PROTECTED]/mySearch/keys/keysX). Contains metadata
> for the different resources referred anywhere in the 'kit' (examples could
> be file name, size, bitrate and so on)
> 3. A number of alternative indexes (like for instance
> [EMAIL PROTECTED]/mySearch/indexes/artist).
> To enable access to items by different metadata
> 4. A metadata file describing which of the indexes and keysX files (and
> possibly words) that are present in the 'db'.
>
> I have a feeling every file except the 'Resource' files would compress
very
> well.. Intelligent use of containers would probably be good..
>
> /N
>
>
>
>
>
>
> _______________________________________________
> devl mailing list
> [EMAIL PROTECTED]
> http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl
>

_______________________________________________
devl mailing list
[EMAIL PROTECTED]
http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Search Indexing Round 2

Reply via email to