Greetings Ted, I'm not sure about any of this, but here is my current understanding:
All of the files you list in the first batch are produced by htdig. db.words.db records the actual words in the documents. For each occurrence of each word (except bad_words and words shorter than a threshold), it stores the location in the document, the last "anchor" point in the document before the word and a set of flags, indicating the context the word is in (a heading? the page title? meta information? etc). It certainly cannot be deleted. db.words.db_weakcmpr is a list of "holes" in the db.words.db file created by the compression process. If it is deleted, you certainly can't do incremental digs. I'm not sure if searching will work without it. It is very small anyway. db.excerpts stores the bit of text that is displayed for each matched document, showing either the start of the document or the context of the first match. In principle you may be able to search without it (using the "short" output format), but htsearch fails to run if it is not present. If you're worried by its size, you are better off setting max_head_length very small (ideally zero, but I'm not sure if that will cause a crash -- could you check?). db.docdb stores some sort of data about each of the documents (the URL, the title, the total number of words, ...). This is "global" information, rather than the "per-word" information in db.words.db. It is definitely necessary. All I know about db.docs.index comes from the attrs.html page... When a URL is encountered during digging, ht://Dig needs to be able to find out whether it has been visited before, and add the words in the hyperlink. That information is stored in this file. Although it sounds from attrs.html like it is safe to delete it, htsearch will refuse to run if it is not present. I think we need to update attrs.html, but it is possible that htsearch could in fact run without it -- I don't know enough about the code. You're right about the two htfuzzy files -- they are useful if you use the corresponding fuzzy rules, but can be safely deleted otherwise. Hope this helps, Lachlan On Wed, 2 Jun 2004 07:03 pm, Ted Stresen-Reuter wrote: > I was wondering if someone could tell me what all the db files are > in my db directory. Specifically, I'd like to know the kind of data > that is in them and what programs produce or use them. The files > are: > > db.docdb > db.docs.index > db.excerpts > db.words.db > db.words.db_weakcmpr > > I'm pretty sure that the following are produced by htfuzzy, but > please correct me if I'm wrong: > db.accents.db > db.metaphone.db > > Ted Stresen-Reuter > > > > ------------------------------------------------------- > This SF.Net email is sponsored by the new InstallShield X. > From Windows to Linux, servers to mobile, InstallShield X is the > one installation-authoring solution that does it all. Learn more > and evaluate today! http://www.installshield.com/Dev2Dev/0504 > _______________________________________________ > ht://Dig Developer mailing list: > [EMAIL PROTECTED] > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-dev ------------------------------------------------------- This SF.Net email is sponsored by the new InstallShield X. >From Windows to Linux, servers to mobile, InstallShield X is the one installation-authoring solution that does it all. Learn more and evaluate today! http://www.installshield.com/Dev2Dev/0504 _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev
