Re: [htdig-dev] Role/Purpose of db files

Lachlan Andrew Wed, 02 Jun 2004 06:55:35 -0700

Greetings Ted,

I'm not sure about any of this, but here is my current understanding:

All of the files you list in the first batch are produced by  htdig.

db.words.db records the actual words in the documents.  For each 
occurrence of each word (except bad_words and words shorter than a 
threshold), it stores the location in the document, the last "anchor" 
point in the document before the word and a set of flags, indicating 
the context the word is in (a heading? the page title? meta 
information? etc).  It certainly cannot be deleted.

db.words.db_weakcmpr is a list of "holes" in the db.words.db file 
created by the compression process.  If it is deleted, you certainly 
can't do incremental digs.  I'm not sure if searching will work 
without it.  It is very small anyway.

db.excerpts stores the bit of text that is displayed for each matched 
document, showing either the start of the document or the context of 
the first match.  In principle you may be able to search without it 
(using the "short" output format), but  htsearch  fails to run if it 
is not present.  If you're worried by its size, you are better off 
setting  max_head_length  very small (ideally zero, but I'm not sure 
if that will cause a crash -- could you check?).

db.docdb stores some sort of data about each of the documents (the 
URL, the title, the total number of words, ...).  This is "global" 
information, rather than the "per-word" information in  db.words.db.  
It is definitely necessary.

All I know about db.docs.index comes from the  attrs.html  page...  
When a URL is encountered during digging, ht://Dig needs to be able 
to find out whether it has been visited before, and add the words in 
the hyperlink.  That information is stored in this file.  Although it 
sounds from  attrs.html  like it is safe to delete it,  htsearch  
will refuse to run if it is not present.  I think we need to update  
attrs.html,  but it is possible that  htsearch  could in fact run 
without it -- I don't know enough about the code.

You're right about the two  htfuzzy  files -- they are useful if you 
use the corresponding fuzzy rules, but can be safely deleted 
otherwise.

Hope this helps,
Lachlan

On Wed, 2 Jun 2004 07:03 pm, Ted Stresen-Reuter wrote:
> I was wondering if someone could tell me what all the db files are
> in my db directory. Specifically, I'd like to know the kind of data
> that is in them and what programs produce or use them. The files
> are:
>
> db.docdb
> db.docs.index
> db.excerpts
> db.words.db
> db.words.db_weakcmpr
>
> I'm pretty sure that the following are produced by htfuzzy, but
> please correct me if I'm wrong:
> db.accents.db
> db.metaphone.db
>
> Ted Stresen-Reuter
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by the new InstallShield X.
> From Windows to Linux, servers to mobile, InstallShield X is the
> one installation-authoring solution that does it all. Learn more
> and evaluate today! http://www.installshield.com/Dev2Dev/0504
> _______________________________________________
> ht://Dig Developer mailing list:
> [EMAIL PROTECTED]
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-dev

-------------------------------------------------------
This SF.Net email is sponsored by the new InstallShield X.
>From Windows to Linux, servers to mobile, InstallShield X is the one
installation-authoring solution that does it all. Learn more and
evaluate today! http://www.installshield.com/Dev2Dev/0504
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] Role/Purpose of db files

Reply via email to