-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
The debugging continues..
I managed to find, with the help of some python gurus, a way to
expose all the strings and objects that Spider.py uses during it's
parse/gather/etc. runs.
In Spider.py, I imported 'gc', then later in the loop, I did the
following:
gc.get_objects()
file("/tmp/pb.log", 'a').write('\n'.join(map(repr, gc.get_objects())))
This will create a HUGE logfile for large parse runs like the one
I'm using now to test this. I think there's some object appending going on
here, where the array is appended to itself (pop(1), pop(1,2), pop(1,2,3),
and so on, where the previous contents are appended to the next element). I
don't yet know where, but all the symptoms point at exactly that kind of
behavior. Each additional page takes exponentially longer to parse than the
previous.
I also configured and installed oprofile (oprofile.sf.net), and
worked with the author on irc (isn't irc great? =) to get it hooked into a
debug build of python (distros do not ship debug builds of python by
default, so I had to build that). The results of it are quite interesting.
I'm doing a running log on a remote server now of the parse, updated
every 4 seconds or so, and tail'ing the last 200 lines of a -V3 Spider.py
and the 'op_time -dnl dump' (oprofile) so you can see where the system is
spending the most time inside the python process.
I configured, built, and installed python 2.2.1 with the following
syntax to help aid in this debugging process:
export CFLAGS=-g
./configure --prefix=/usr --sysconfdir=/etc --with-pydebug \
--with-signal-module --with-threads --enable-ipv6 \
--with-cycle-gc --with-pymalloc
This gives me a nice build which I can then attach gdb to, trace,
valgrind (x86 memory debugger, http://developer.kde.org/~sewardj/), and
oprofile.
For the curious, the running output of this absolutely huge build
test (26,516 files)is being dumped here. You'll need to shift-reload it
every few minutes, the log is self-overwriting, it's anywhere from 60-90k in
size, depending on when you happen to catch it:
http://66.93.78.136/plucker-build-biam.log
The get_objects dump file is currently 1.5Gb of open objects after
about 15 minutes of parsing, roughly 160 files grabbed so far. When/if this
parser run completes or fails, I'll poke through this and see if I can find
any reason why this is taking so long/is so slow/takes up so much RAM.
There's a huge memory leak here, I want to nail it.
d.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org
iD8DBQE8/AB0kRQERnB1rkoRAtLlAJ9apble+QufNH/Jj4DcxJCBJdGYlQCgrAVS
uabd74Pa5wXj8rT1XtFrk4g=
=mKnf
-----END PGP SIGNATURE-----