...that is, the 26521 node test case.

I've tried two things so far: changing the algorithm from
breadth-first to depth-first (major win), and removing some dicts from
the parsed form of the page after they've been used (minor win).  The
next thing to try is to get rid of the contents (but not the keys) of
the _failed hash in the Spider instance -- we never use them for
anything.  We could, of course, continue to trade time for space, by
eliminating the "_failed" table entirely, but that seems to be going
too far.

The good news is that (for a machine with enough RAM) there doesn't
seem to be any appreciable slowdown as the number of nodes processed
keep growing, so our data structures probably at least aren't
pathological.

Just for grins, here are the termination states before and after the
tweaks.  Using Python 1.5.2, things just don't work well when we go
over 2 GB of virtual memory :-).  In K, the magic number is 2097152 --
something to remember when looking at the VSS in the following.  The
time reported is processor time on a 450 MHz UltraSparc II.

Bill

Original version:  

---- 10068 collected, 410592 to do ----
Traceback (innermost last):
  File "/tilde/janssen/plucker/bin/plucker-build", line 1319, in ?
    sys.exit(realmain())
  File "/tilde/janssen/plucker/bin/plucker-build", line 1312, in realmain
    retval = main (config, exclusion_lists)
MemoryError
Wed Jun  5 17:49:54 PDT 2002

  PID  VSZ     RSS        TIME  
 9376 2099568 2099160   02:44:27

**************************************************************************

Changed from breadth-first to depth-first:

---- 21062 collected, 171 to do ----
Processing file:biam/Spe7814.html...
Traceback (innermost last):
  File "/pluckerhome/bin/plucker-build", line 1319, in ?
    sys.exit(realmain())
  File "/pluckerhome/bin/plucker-build", line 1312, in realmain
    retval = main (config, exclusion_lists)
  File "/pluckerhome/bin/plucker-build", line 855, in main
    spider.process_all(verbose=verbosity)
  File "/pluckerhome/bin/plucker-build", line 437, in process_all
    self.process (verbose)
  File "/pluckerhome/bin/plucker-build", line 510, in process
    message(3, "checking " + str(key))
  File "/pluckerhome/python/PyPlucker/UtilFns.py", line 54, in message
    actual_message = actual_message + '\n'
MemoryError
Wed Jun  5 21:24:37 PDT 2002

  PID  VSZ  RSS           TIME
13315 2097792 2097384   03:05:18

**************************************************************************

Added clear_external_references() to
PyPlucker.PluckerDocs.PluckerTextParagraph and
PyPlucker.PluckerDocs.PluckerTextDocument:

---- 22526 collected, 5754 to do ----
Processing file:biam/Spe29638.html...
  Retrieved ok.
Segmentation Fault
Thu Jun  6 01:50:04 PDT 2002

  PID  VSZ     RSS        TIME
18884 2098304 2097896   03:14:42

Reply via email to