Ok, this is a follow up to a message a few weeks ago. I'm hope that I'm making this easy enough to duplicate. I have a spider (attached below) that eats memory like crazy. After 61 fetched docs I see this kind of memory usage: >> +Fetched 61 http://www.microsoft.com/office/worldwide.htm 200 text/html 21140 S USER PID PPID %CPU %MEM RSS VSZ COMMAND R lii 8356 12150 34.3 68.3 87336 88488 perl spider.pl Just downgrading to URI 1.11 and after 67 docs: >> +Fetched 67 http://www.microsoft.com/office/downloads/default.htm 200 text/html >33375 S USER PID PPID %CPU %MEM RSS VSZ COMMAND R lii 8391 12150 10.8 5.4 7012 8184 perl spider.pl That's a little better, no? Tested with perl 5.6.1 and perl 5.6.0 and the modules listed below. I've tested on both Linux 2.2.13 and FreeBSD 2.2.8-STABLE. I'll attach my perl -V below. I built perl with: sh Configure -Dprefix=~/test_perl -de If I downgrade from URI 1.15 to 1.11 the problem goes away. (The problem shows up when moving from 1.11 to 1.12.) If the spider() sub (below) takes a scalar instead of a URI object then it doesn't eat memory. This is easy to test -- just call spider( $u->as_string ) where spider is called, and change the spider sub to start like this: sub spider { my ( $url ) = @_; my $uri = URI->new( $url ); Then it no longer eats memory. That, to me, seems like a problem with perl, but it's also triggered by some change from URI 1.11 to 1.12. Also, if I run it non-recursively it doesn't eat memory. That is modify the code attached below to something like: my @uri = ( $uri ); spider( \@uri ) while @uri; #----------- Process a url and recurse ----------------------- sub spider { my ( $uriref ) = @_; my $uri = shift @$uriref; ... my $links = extract_links( \$content, $response ); push @$uriref, @$links; } That should be a work around, but I've had report from someone saying that eats memory, too, under both FreeBSD and Mac OS X. This problem happens with a new install of perl 5.6.0 (or 5.6.1) with *only* installing the following modules: URI-1.15.tar.gz MIME-Base64-2.12.tar.gz HTML-Tagset-3.03.tar.gz HTML-Parser-3.25.tar.gz libnet-1.0703.tar.gz Digest-MD5-2.14.tar.gz libwww-perl-5.53.tar.gz Below is a trimmed-down version of the original code, but should demonstrate the problem. #!/usr/local/bin/perl -w use strict; use LWP::UserAgent; use HTML::LinkExtor; my %visited; # track which URLs were seen my $server = { base_url => 'http://www.microsoft.com/', link_tags => [qw/ a frame /], }; my $uri = URI->new( $server->{base_url} ); $server->{same} = [ $uri->authority ]; my $count = 0; spider( $uri ); #----------- Process a url and recurse ----------------------- sub spider { my ( $uri ) = @_; return if $visited{ $uri->canonical }++; my $ua = LWP::UserAgent->new; $ua->parse_head(0); # Don't parse the content my $request = HTTP::Request->new('GET', $uri ); my $response = $ua->simple_request( $request ); my $content = $response->content; # Log the response print STDERR '>> ', join( ' ', ( $response->is_success ? '+Fetched' : '-Failed' ), ++$count, $response->request->uri->canonical, $response->code, $response->content_type, $response->content_length, ),"\n"; unless ( $response->is_success ) { # look for redirect if ( $response->is_redirect && $response->header('location') ) { my $u = URI->new_abs( $response->header('location'), $response->base ); spider( $u ); } return; } return unless $content; my $links = extract_links( \$content, $response ); # Now spider spider( $_ ) for @$links; } sub extract_links { my ( $content, $response ) = @_; return [] unless $response->header('content-type') && $response->header('content-type') =~ m[^text/html]; my @links; my $base = $response->base; my $p = HTML::LinkExtor->new; $p->parse( $$content ); my %skipped_tags; for ( $p->links ) { my ( $tag, %attr ) = @$_; next unless $tag eq 'a'; # which are valid link my $links = $HTML::Tagset::linkElements{$tag}; $links = [$links] unless ref $links; my $found; my %seen; for ( @$links ) { if ( $attr{ $_ } ) { # ok tag my $u = URI->new_abs( $attr{$_},$base ); $u->fragment( undef ); next unless $u->scheme =~ /^http$/; next unless $u->host; next unless grep { $u->authority eq $_ } @{$server->{same}}; push @links, $u; $found++; } } } return \@links; } ~/test_perl/bin/perl -V Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration: Platform: osname=linux, osvers=2.2.13, archname=i686-linux uname='linux mardy 2.2.13 #1 mon nov 8 15:51:29 cet 1999 i686 unknown ' config_args='-Dprefix=~/test_perl -de' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=define use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef Compiler: cc='cc', optimize='-O2', gccversion=egcs-2.91.66 19990314/Linux (egcs-1.1.2 release) cppflags='-fno-strict-aliasing -I/usr/local/include' ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' stdchar='char', d_stdstdio=define, usevfork=false intsize=4, longsize=4, ptrsize=4, doublesize=8 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lndbm -lgdbm -ldbm -ldb -ldl -lm -lc -lposix -lcrypt libc=, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Characteristics of this binary (from libperl): Compile-time options: USE_LARGE_FILES Built under linux Compiled at Jul 20 2001 19:54:20 Bill Moseley mailto:[EMAIL PROTECTED]
