Hi Anton,
 
That actually did not work for me.  Somewhere later there is an email stating it did not work as planned...  I thought it had worked but in actuality, someone had added a lot of words to the bad words list and the db came out to just at 2 GB.  I now have over 575,000 PDF docs I index and I don't use an index.html with links to each, I use apache auto indexing and my htdig.conf looks something like this:
 
start_url:      http://myserver.domain.org/docs/
local_urls:     http://myserver.domain.org/docs/=/some/local/dir/
timeout:                420
wordlist_cache_size:    2000000
wordlist_compress:      false
bad_extensions:         .gz ..tar ..jpg .htm .tgz .rpm .gif .png .pl ..sh
bad_querystr:           ?D=A ?D=D ?M=A ?M=D ?N=A ?N=D ?S=A ?S=D
# The line above is the one you want for Apache auto indexes, I think.
#It's been a while since I tinkered with the conf file.

max_head_length:        10000
max_stars:              5
robotstxt_name:         localdig
 
max_doc_size:           5000000
matches_per_page:       100
maximum_word_length:    35
minimum_word_length:    3
 
My bad words list is pretty long so that helped keep the size down on the databases.  The 2 GB barrier is still there, no matter what filesystem.  I think this has been resolved in 3.2, but it's still beta.  I just adjust min word length and bad words list to stay under 2 GB.  Sorry for getting your hopes up!

Message: 6
Date: Wed, 10 Mar 2004 18:27:12 +0100
From: Anton Donner <[EMAIL PROTECTED]>
Organization: DLR
To: [EMAIL PROTECTED]
Subject: [htdig] db.docdb 2 GByte Limit?

Dear all,

after many many unsuccessful attempts I really hope that the htdig
community can help me. My problem is as follows:
I have a quite large server with more than 100000 PDFs on it. For
indexing I create an HTML file with links to all PDFs and use this file
as start_url. But now it seems that I have found a magical 2GByte limit,
because indexing (a htdig run) stops as soon as db.docdb reaches a size
of 2147483647 (2^31 - 1) bytes. I can see in the log-files (htdig -vv)
that htdig simply stops and does not process the remaining PDFs.

Unsuccessful attempts have been so far:
- installation of htdig 3.1.5/3.1.6 (self compiled, i.e. no package)
- db-directory on a ext2/ext3/reiser partition
- kernel 2.4.10 (Suse 7.3)
- kernel 2.4.21 (Suse 9)

I've read in
http://www.geocrawler.com/mail/msg.php3?msg_id=9056546&list=8822 that
Reiser-FS could be an option, but it didn't work for me. Besides I can
easily create files bigger than 2 GByte already on a ext2 partition (I
really checked that with a shell script). Htdig 3.2.0b5 is not really an
option since diging is by a factor of ten slower than 3.1.6 (which would
mean full ten days of indexing) despite of possible optimisations
described in the FAQ.

I know that file sizes could be a matter of architecture (I run an x86
one), but also of the kernel (older kernels have had this 2 GByte limit,
but I have a brandnew one?!?).

What makes me wonder is that the author of the link above could overcome
his problems with a simple change of his file system, but I can't...

Any help is really appreciated.

Thanks,

Anton


 

Bill Akins
SSS III
Emory Healthcare
(404) 712-2879 - Office
12674 - PIC
[EMAIL PROTECTED]

Reply via email to