According to David Adams:
> This is a provisional report, but it very much looks as though a single web
> page consisting of a long list of numbers can set htdig 3.1.5 burning the
> CPU for a period of several hours.  My configuration file includes:
> 
> allow_numbers:    yes
> 
> Indexing the page
> http://www.maths.soton.ac.uk/postgraduate/students/Moxham/48.txt
> appears to have taken over 24hours CPU time on a powerful SGI box, and htdig
> is now number crunching the next similar file,
> http://www.maths.soton.ac.uk/postgraduate/students/Moxham/spp2.txt
> 
> I am reporting this ASAP as it may account for some reports that htdig takes
> days to complete, when for me it normally indexes ~60,000 documents from
> scratch in less than 5 hours.

Right you are!  I was too late to grab the 48.txt file, but the spp2.txt
file was still there, and I can reproduce the problem in 3.1.5.  At least,
it's taking a huge amount of time (it's not done yet).  I think I'll
kill it and compile the code for profiling.  My guess would be all the
numbers don't work well with the hashing function used to build the
in-memory word list, so it's degenerating to a linear search.  Profiling
may confirm that or point elsewhere.

I tried it with the 3.2.0b4-070801 snapshot, using a file:// URL, and
it took 8 minutes on an 866 MHz Pentium III, but it messed up somehow.
It seems to have lost all the newlines from the file, so it tried to
index 1 big number.  I'll need to look into whether this is a problem
with the HtFile handler or the Plaintext parser.  Once I debug it, I'll
try profiling it too, although the word indexing is quite different in
3.2.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to