Re: htdig: Preliminary proposal for data structures to suppo

plucas Mon, 8 Jun 1998 15:26:59 -0400

     Whilst I would certainly like to be able to do phrase searching the 
     most important element of any search engine to me has got to be speed.
     
     In posts over the last few months I have noticed Andrew and several 
     others suggesting a move away from GDBM (hashed type?) databases to 
     btree or even SQL to improve performance.
     
     I see that your examples use a GDBM database and their "index will be 
     at least 3 times the size of the collected documents".
     
     Would a different type of database be able to achieve the same thing 
     more efficiently? With 2.5GB of data, a 7.5GB index would be a high 
     price to pay for a useful function.
     
     Paul Lucas
     Frost & Sullivan


______________________________ Reply Separator _________________________________
Subject: htdig: Preliminary proposal for data structures to support p
Author:  jmoore <[EMAIL PROTECTED]> at Internet
Date:    6/9/98 12:11 AM


     
This post is for anyone who is interesting in discussing new features for 
the htdig engine.
     
One of the main features that i would like to see in the new version of 
htdig is support for searching on a phrase.  As you may know, currently 
only individual words are kept in the index, which makes it easy to search 
for things like "purina and dogs and chow" but not "purina dog chow". 
What I propose is to store the words before and after each word in the 
index. 
     
e.g.
     
db.words.gdbm (GDBM) would contain a structure like:
     
  Key    |  Value
  -------+----------------------------------------- 
  <Word> |  <Previous Word><Next Word>
         |  <Count><ID><Weight><Anchor><Location>
         |  [repeats for each instance of <Word> in a document...]
     
So, the phrase "object oriented database" would be retrieved by looking up 
the references to "oriented" and rejecting those that don't have the 
previous word "object" or the next word "database". 
     
(Note: this could be generalized to handle a phrase with any number of 
words.) 
     
The main problem with this approach as outlined, is that the index will be 
at least 3 times the size of the collected documents since the previous and 
next word is stored for each word.  There are probably a lot of 
optimizations that can happen here - the first is to use 2 byte short ints 
for word id's and keep the mappings between the words and their IDs in a 
separate gdbm file.  (This will also fit in with the current setup where 5 
integers are packed and assigned to a "word" in "db.words.gdbm"). 
     
Another optimization would be for all document collections to share the 
same Word->ID index.  (There would be some issues around to ensure that 
this index could be recreated, since if it was corrupted all indexes 
would have to be rebuilt - big hassle.)
     
Another optimization would be to drop certain words from the index. 
Presumably, if stop words are dropped from the users search term and the 
indexes, the slight strangeness of the results would still be intuitive 
to the end user. e.g. A search for "Sid and Nancy" would be converted to 
"sid nancy" and would return matches in the index to "Sid Nancy" or "Sid 
and Nancy" or "Sid of Nancy" (assuming "and" and "of" were stop words). 
I haven't fully worked out the details of this one yet.
     
I've been prototyping the data model with perl so, hopefully I can have 
some more specific factual data as to the size of the indexes.  I think 
disk space - and balancing between the accuracy of the index and its size 
will probably be the key issue.  I'm including more details of the
data structures below.  Please post any comments or concerns to the list.
     
:jason 
     
[EMAIL PROTECTED]
     
The following 3 data structures are GDBM files.
     
  Name: DocumentDB
  Key            | Value
  ---------------+--------------------------------- 
  <Document ID>  | <What>       [char] 
                 | <Time>       [integer]
                 | <Accessed>   ["]
                 | <State>      ["]
                 | <Size>       ["]
                 | <Links>      ["]
                 | <ImageSize>  ["]
                 | <HopCount>   ["]
                 | <URL>        [variable size]
                 | <Head>       ["]
                 | <Title>      ["]
                 | <Description>["]
                 | <Email>      ["]
                 | <Notification> ["]
                 | <Subject>    ["]
     
  Name: WordDB
  Key       | Value
  ----------+-------------------------------------- 
  <Word ID> | <Previous Word ID> [unsigned short int]
            | <Next Word ID>     ["]
            | <Document ID>      ["]
            | <Count>            ["]
            | <Weight>           ["]
            | <Anchor>           ["]
            | <Location>         ["]
            | [ entire structure repeats for each occurance 
            |   of the word in the documents... 14 bytes/occurance]
     
  Name: Dictionary
  Key       | Value
  ----------+-------------------------------------- 
  <Word>    | <Word ID> [unsigned short int] [unique]
     
     
     
     
     
     
---------------------------------------------------------------------- 
To unsubscribe from the htdig mailing list, send a message to 
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.

Received: from eudoramail.frost.com ([207.221.223.6]) by smtp.frost.com with
SMTP
  (IMA Internet Exchange 3.01 Enterprise) id 0000FA4B; Mon, 8 Jun 98 11:29:42
-0700
Received: from sdsu.edu (130.191.229.14) by eudoramail.frost.com (Worldmail
1.3.167) for [EMAIL PROTECTED]; 7 Jun 1998 23:07:13 -0500
Received: (from daemon@localhost)
        by sdsu.edu (8.8.7/8.8.7) id WAA16003
        for htdig-out; Sun, 7 Jun 1998 22:17:33 -0700 (PDT)
Received: from sartre ([EMAIL PROTECTED] [209.5.16.7])
        by sdsu.edu (8.8.7/8.8.7) with ESMTP id WAA15998
        for <[EMAIL PROTECTED]>; Sun, 7 Jun 1998 22:17:27 -0700 (PDT)
Received: from localhost (jmoore@localhost) by sartre (8.7.6/8.7.3) with SMTP id
AAA26797 for <[EMAIL PROTECTED]>; Tue, 9 Jun 1998 00:11:43 -0400
X-Authentication-Warning: sartre: jmoore owned process doing -bs
Date: Tue, 9 Jun 1998 00:11:42 -0400 (EDT)
From: jmoore <[EMAIL PROTECTED]>
X-Sender: jmoore@sartre
To: [EMAIL PROTECTED]
Subject: htdig: Preliminary proposal for data structures to support phrases
Message-ID: <Pine.LNX.3.95.980609000327.26504A-100000@sartre>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: [EMAIL PROTECTED]
Precedence: bulk
Reply-To: jmoore <[EMAIL PROTECTED]>

Re: htdig: Preliminary proposal for data structures to suppo

Reply via email to