According to David Adams:
> This is a provisional report, but it very much looks as though a single web
> page consisting of a long list of numbers can set htdig 3.1.5 burning the
> CPU for a period of several hours. My configuration file includes:
>
> allow_numbers: yes
>
> Indexing the page
> http://www.maths.soton.ac.uk/postgraduate/students/Moxham/48.txt
> appears to have taken over 24hours CPU time on a powerful SGI box, and htdig
> is now number crunching the next similar file,
> http://www.maths.soton.ac.uk/postgraduate/students/Moxham/spp2.txt
>
> I am reporting this ASAP as it may account for some reports that htdig takes
> days to complete, when for me it normally indexes ~60,000 documents from
> scratch in less than 5 hours.
Right you are! I was too late to grab the 48.txt file, but the spp2.txt
file was still there, and I can reproduce the problem in 3.1.5. At least,
it's taking a huge amount of time (it's not done yet). I think I'll
kill it and compile the code for profiling. My guess would be all the
numbers don't work well with the hashing function used to build the
in-memory word list, so it's degenerating to a linear search. Profiling
may confirm that or point elsewhere.
I tried it with the 3.2.0b4-070801 snapshot, using a file:// URL, and
it took 8 minutes on an 866 MHz Pentium III, but it messed up somehow.
It seems to have lost all the newlines from the file, so it tried to
index 1 big number. I'll need to look into whether this is a problem
with the HtFile handler or the Plaintext parser. Once I debug it, I'll
try profiling it too, although the word indexing is quite different in
3.2.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html