Hello all!

While running HtDig (version 3.1.6) with -i and -vvv, it runs for quite
a while (like 19 hours) and then I get this:

title: doc8995.pdf
 size: 6536
pick: server.com.com, # servers = 1
452364:452364:2:http://server.com.com/docs/dir76/doc8995.pdf
Read 3957 from document
Read a total of 3957 bytes

title: doc8996.pdf
File size limit exceeded
[root@system Dir]#

The listing of the dir where dig data files reside:
-rw-r--r--    1 root    root    1369112576  Jun 23 11:05 db.docdb
-rw-r--r--    1 root    root    2147483647  Jun 23 11:05 db.wordlist

The significant portions of my .conf file:

wordlist_cache_size:     50000000
wordlist_compress:       false
allow_numbers:           true
valid_puctuation:        -/
max_head_length:         1500000
max_doc_size:            150000000
search_algorithm:        exact:1 synonyms:0.1 endings:0.1

I have all of the external parsers defined as well.  All was working
fine until last Tuesday when we added 1,252 new PDF docs and I performed
a -a dig and was missing a lot of docs.  The report from HtDig -a
script: (found in contrib./scripts on HtDig web):

rundig: Start time: Tue Jun 18 14:00:00 EDT 2002
rundig: Done Digging: Tue Jun 18 14:48:11 EDT 2002
htmerge: Total word count: 1511472
htmerge: Total documents: 17493
htmerge: Total size of documents (in K): 127347
rundig: Done Merging: Tue Jun 18 15:29:22 EDT 2002
rundig: End time: Tue Jun 18 15:36:25 EDT 2002

Output from the previous week:  (<< Note the differences! >>)
rundig: Start time: Tue Jun 11 16:48:21 EDT 2002
rundig: Done Digging: Tue Jun 11 17:19:12 EDT 2002
htmerge: Total word count: 1504940
htmerge: Total documents: 459082
htmerge: Total size of documents (in K): 2115373
rundig: Done Merging: Tue Jun 11 18:24:50 EDT 2002
rundig: End time: Tue Jun 11 18:36:47 EDT 2002

That is why I tried the -i in an alternate dir.  Does anything jump out
at you as being very, very wrong in either the conf file or the output? 
The files are stored on a 120 GB RAID array and I am only using like 15
GB of disk space in PDF files.  Total file count is 460996 PDF files in
89 dirs that are being indexed.  I have tested each of the new PDF files
and know that they are good.  They were created in the exact same manner
as the other 400,000+ PDF files.  BTW, the size of the file it died on,
doc8996.PDF, is only 4096 bytes and has been indexed successfully in the
past.

System is Linux RedHat 7.3 with all RH patches (except for upgrading to
HtDig 3.2), 1 GB RAM, more than 120 GB free disk space on RAID 5 array,
Apache 1.3.23-11 webserver using fancy indexing so dig can find the
files, xpdf is ver. 1.00-3.  Sorry for the length of this email, just
wanted to supply as much info as I could.  Thanks for any input!



Bill Akins, CNE
Sr. OSA
Emory Healthcare
(404) 712-2879 - Office
12674 - PIC
[EMAIL PROTECTED]


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CONFIDENTIALITY NOTICE:

This message may contain legally confidential and privileged information
and is intended only for the named recipient(s).  No one else is 
authorized to read, disseminate, distribute, copy, or otherwise disclose
the contents of this message.  If you have received this message in 
error, please notify the sender immediately by e-mail or telephone and 
delete the message in its entirety. Thank you.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
<<<<GWIASIG 0.06c>>>>


-------------------------------------------------------
Sponsored by:
ThinkGeek at http://www.ThinkGeek.com/
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to