Re: keeping the table ordered

Nurullah Akkaya Tue, 06 Feb 2007 10:01:26 -0800


On Feb 6, 2007, at 11:25 AM, Michael Segel wrote:

Sorry to top post, on my crackberry...

I think you missed my point.
Select the count of your documents that use the word 'the'.
Ok so let's say that you want to search for all of the documentsthat use the word 'the'.You first lookup the integer representation of the word. Let's saythat its = 100.
How many times is the value 100 going to be in your index?

that varies with the document set with 2 million documents i havearound 2.5 million 'the' entries.

Ok?
But to your other point... You see that your data is notcontiguous. Hmmm ok,so assuming that your primary index is wordID,how do you handle documents that have multiple words? So if yousearch on 'the' you'll get one set of data and if you then searchon the wordID for 'is', you'll have data that isn't in sort orderon the disk.

assume the following ids.
the -> 100
is -> 150
101 -> linux

i want my tables to be sorted like the following. not just the wordthe but all id's are sorted

from my knowledge of databases they sorted in random order thus wehave indexes pointing where the data is. from the upper example i amgoing to read one big chunk of data from the disk but in the bottomexample i will read 100 then jump a buch of records and read next.

where can i learn more about the compound index. create indexstatement in ref manual doesn't mention it?

Now here's something that may help,
Drop all of your indexes and create a single compound index wherethe first field is wordID.
That may help you out...


Sent via BlackBerry.

-Mike Segel
Principal
MSCC
312 952 8175


-----Original Message-----
From: Nurullah Akkaya <[EMAIL PROTECTED]>
Date: Tue, 6 Feb 2007 11:14:02
To:Derby Discussion <[email protected]>
Subject: Re: keeping the table ordered
It is not quite clear to me what you are trying to achieve. Why doyou want a sequential read? Scanning the entire table of 100million records should take longer time than looking up a recordusing a index on wordid. Have you retrieved the query plan andmade sure the index on wordid is used? Or are you talking aboutdoing a lookup of many different wordids in sorted order?
i did not meant sequential scanning of the whole table i meant diski/o( bottom paragraph explains it )yes i checked the query plan and derby uses index to lookup recordsand index look up checks only two index pages. so i came to theconclusion that most of the time is lost making random i/o requestfor the data thats why i am trying to keep the table sorted. sincesequential hard disk access is much faster than random i/o .
On Feb 6, 2007, at 8:09 AM, Michael Segel wrote:








What exactly are you trying to do?
Based on the little snippet, it looks like this is an exercise tocreate a
"google like" search on a series of documents.
The problem is that your wordID, while an integer, is not going tobe unique
enough.
wordId isn't unique at all each word in a document gets acorresponding posting entry i look up wordId for the word the thenselect all docId's containg the wordId. that posting list isbasicly a big inverted list. what i am trying to do is keep thetable sorted by wordId so insted of keeping values randomly on diskthey are being written sequentialy to the file so that instead ofdoing random i/o i just do a sequential read from the hard drive. idon't want sequential scanning of the whole table.
For example, search your documents where the wordID is the integerlook up for
the word "the".


Do you see the problem?


--
--
Michael Segel
Principal
Michael Segel Consulting Corp.
derby [EMAIL PROTECTED]: <mailto:[EMAIL PROTECTED]>
(312) 952-8175 [mobile]

Re: keeping the table ordered

Reply via email to