|
Hi Anton,
That actually did not work for me. Somewhere later there is an email
stating it did not work as planned... I thought it had worked but in
actuality, someone had added a lot of words to the bad words list and the db
came out to just at 2 GB. I now have over 575,000 PDF docs I index and I
don't use an index.html with links to each, I use apache auto indexing and my
htdig.conf looks something like this:
start_url: http://myserver.domain.org/docs/
local_urls: http://myserver.domain.org/docs/=/some/local/dir/ timeout: 420 wordlist_cache_size: 2000000 wordlist_compress: false bad_extensions: .gz ..tar
..jpg .htm .tgz .rpm .gif .png .pl
..sh
bad_querystr: ?D=A ?D=D ?M=A ?M=D ?N=A ?N=D ?S=A ?S=D # The line above is the one you want for Apache auto indexes, I
think.
#It's been a while since I tinkered with the conf
file.
max_head_length: 10000 max_stars: 5 robotstxt_name: localdig max_doc_size:
5000000
matches_per_page: 100 maximum_word_length: 35 minimum_word_length: 3 My bad words list is pretty long so that helped keep the size down on the
databases. The 2 GB barrier is still there, no matter what
filesystem. I think this has been resolved in 3.2, but it's still
beta. I just adjust min word length and bad words list to stay under 2
GB. Sorry for getting your hopes up!
Message: 6 Date: Wed, 10 Mar 2004 18:27:12 +0100 From: Anton Donner <[EMAIL PROTECTED]> Organization: DLR To: [EMAIL PROTECTED] Subject: [htdig] db.docdb 2 GByte Limit? Dear all, after many many unsuccessful attempts I really hope that the htdig community can help me. My problem is as follows: I have a quite large server with more than 100000 PDFs on it. For indexing I create an HTML file with links to all PDFs and use this file as start_url. But now it seems that I have found a magical 2GByte limit, because indexing (a htdig run) stops as soon as db.docdb reaches a size of 2147483647 (2^31 - 1) bytes. I can see in the log-files (htdig -vv) that htdig simply stops and does not process the remaining PDFs. Unsuccessful attempts have been so far: - installation of htdig 3.1.5/3.1.6 (self compiled, i.e. no package) - db-directory on a ext2/ext3/reiser partition - kernel 2.4.10 (Suse 7.3) - kernel 2.4.21 (Suse 9) I've read in http://www.geocrawler.com/mail/msg.php3?msg_id=9056546&list=8822 that Reiser-FS could be an option, but it didn't work for me. Besides I can easily create files bigger than 2 GByte already on a ext2 partition (I really checked that with a shell script). Htdig 3.2.0b5 is not really an option since diging is by a factor of ten slower than 3.1.6 (which would mean full ten days of indexing) despite of possible optimisations described in the FAQ. I know that file sizes could be a matter of architecture (I run an x86 one), but also of the kernel (older kernels have had this 2 GByte limit, but I have a brandnew one?!?). What makes me wonder is that the author of the link above could overcome his problems with a simple change of his file system, but I can't... Any help is really appreciated. Thanks, Anton |
- [htdig] db.docdb 2 GByte Limit? Anton Donner
- Re: [htdig] db.docdb 2 GByte Limit? Jim
- Bill Akins

