Re: [freenet-dev] Search Indexing Round 2

Niklas Bergh Sun, 17 Aug 2003 22:35:45 -0700

> On Sun, 2003-08-17 at 18:24, Scott Young wrote:
> > > I think it would be better if the client application calculated the
> > > scoring. I assume that the 'weight' you mention here is somehow
> > > calculated from where in the page the word was found and how
'valuable'
> > > in the given page the word is and so on..
> >
> > The "weight" could be calculated by any means the index-publisher wants,
> > but it should generally indicate the relevance of a certain page to a
> > certain keyword.
> >
> >
> > > I would recommend that the index file contained this information
instead
> > > <Information Domain> <Relative position> <KEY>
> >
> > So basically you're saying more metadata should be stored on these index
> > pages so that better queries can be done.


Yes, but the index pages should only contain metadata that can be considered
as belonging to the word (as in the example above)

>I can see two ways that we
> > can handle orthogonal metadata: include it with the data in a particular
> > index, or include it in it's own separate index that uses the mechanism
> > above.  For example, if you want to have a song search engine, you could
> > have an index for the name of the song, and another index for the
> > artists.  Orthoganal metadata like Genre and bitrate could be stored
> > along with the entries instead of in their own indexes.  If they were
> > stored in their own indexes, then the page for "128 kbps" would be
> > unacceptably HUGE.
> >
> > This is starting to look like a database.  Databases need less storage
> > space if they are normalized.  With multiple indexes, pages could be
> > stored like this:
> >
> >
> > [EMAIL PROTECTED]/mySearch/keys/keys1
> Sorry... my message got loose before it was done (Evolution is acting
> up).
>
>
> as I was saying, you could have a listing of every page that your search
> system indexes, split across several files.  For example
>
> [EMAIL PROTECTED]/mySearch/keys/keys1
> would contain
> 1 "[EMAIL PROTECTED]/ItDontMeanAThing.mp3" "Jazz" "Ella Fitzgerald"
> 2 "[EMAIL PROTECTED]/Help.mp3" "Rock" "The Beatles"
> ...

(Primitive) partitioning indeed, your ideas are starting to catch up with
with newer RDBMS functionallities.

> The first number is just an index number.  It is used so that any other
> indexes only need to store that index number and its weight (which means
> saved space when a certain file is indexed in multiple places.) Other
> data that is on a 1-to-1 correspondence with the key can also be put
> here.
>
> The "keys1" page would contain entries 1 through 100, "keys2" would
> contain 101 through 200, etc.
>
> An index like this:
> [EMAIL PROTECTED]/mySearch/indexes/artist
> would contain pages that contain information about specific artists.
>
> For example:
> [EMAIL PROTECTED]/mySearch/indexes/artist/Beatles
> could list:
> 2 10
> 15 10
> 30 10
> 34234 10
> 545 10
> where the first number refers to the index number of a song in the
> "keys" directory, and the second number is the weight.  When querying
> the Beatles page, the search engine would then request pages 1, 5, and
> 342 (which contain the keys for the songs listed in this particular
> index).
>
>
> There could also be an index for the song title.  A user could say "Give
> me every Beatles song named Help that is at least 160 kpbs."  The search
> engine could then come up with a query plan.  The query plan would
> probably be to search the artist index for "Beatles," then look in the
> Song Title index for "Help," take the intersection, take the resulting
> key index numbers and look them up in the key pages, and then filter
> those the results on the bitrate.
>
>
>
> I guess what I'm getting at is more than just a search capability.  It
> could actually work as a database on freenet, albeit a high-latency
> one.  Searching would be the immediately obvious application of a
> database system, but there might be other later uses.

Actual content indexing *isn't* an obvious application of an RDBMS, however,
the storage structure of a generated index (index as in content index, not
as in RDBMS index) fits very well into an RDBMS

> So the big question is, who wants to write an RDBMS over Freenet so
> Freenet can get some really good searching capability?

Let me just tie this in with the searching discussion.
What we need of an RDMS to implement the search system that we want is:
1. A word-index (originally located at [EMAIL PROTECTED]/mySearch/<word>). Used for
content indexing/locating
which resources that contains a certain word
2. Resource metadata (possibly partitioned into multiple
[EMAIL PROTECTED]/mySearch/keys/keysX). Contains metadata
for the different resources referred anywhere in the 'kit' (examples could
be file name, size, bitrate and so on)
3. A number of alternative indexes (like for instance
[EMAIL PROTECTED]/mySearch/indexes/artist).
To enable access to items by different metadata
4. A metadata file describing which of the indexes and keysX files (and
possibly words) that are present in the 'db'.

I have a feeling every file except the 'Resource' files would compress very
well.. Intelligent use of containers would probably be good..

/N






_______________________________________________
devl mailing list
[EMAIL PROTECTED]
http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Search Indexing Round 2

Reply via email to