Memory eating problem in URI 1.12

Bill Moseley Fri, 20 Jul 2001 22:18:51 -0700
Ok, this is a follow up to a message a few weeks ago.  I'm hope that I'm 
making this easy enough to duplicate.

I have a spider (attached below) that eats memory like crazy.  

After 61 fetched docs I see this kind of memory usage:

>> +Fetched 61 http://www.microsoft.com/office/worldwide.htm 200 text/html 21140

S USER       PID  PPID %CPU %MEM  RSS   VSZ COMMAND
R lii       8356 12150 34.3 68.3 87336 88488 perl spider.pl

Just downgrading to URI 1.11 and after 67 docs:

>> +Fetched 67 http://www.microsoft.com/office/downloads/default.htm 200 text/html 
>33375

S USER       PID  PPID %CPU %MEM  RSS   VSZ COMMAND
R lii       8391 12150 10.8  5.4 7012  8184 perl spider.pl

That's a little better, no?

Tested with perl 5.6.1 and perl 5.6.0 and the modules listed below.  
I've tested on both Linux 2.2.13 and FreeBSD 2.2.8-STABLE.
I'll attach my perl -V below.  I built perl with:

         sh Configure -Dprefix=~/test_perl -de

If I downgrade from URI 1.15 to 1.11 the problem goes away.  
(The problem shows up when moving from 1.11 to 1.12.)

If the spider() sub (below) takes a scalar instead of a URI object then it doesn't eat 
memory.  

This is easy to test -- just call spider( $u->as_string ) where spider is
called, and change the spider sub to start like this:

sub spider {
    my ( $url ) = @_;
    my $uri = URI->new( $url );

Then it no longer eats memory.  That, to me, seems like a problem with perl, but it's 
also triggered by some change from URI 1.11 to 1.12.

Also, if I run it non-recursively it doesn't eat memory.  That is modify the code 
attached below to something like:

    my @uri = ( $uri );
    spider( \@uri ) while @uri;

#----------- Process a url and recurse -----------------------
sub spider {
    my ( $uriref ) = @_;
    my $uri = shift @$uriref;
    ...
    my $links = extract_links( \$content, $response );
    push @$uriref, @$links;
}


That should be a work around, but I've had report from someone saying that eats 
memory, too, under both FreeBSD and Mac OS X.



This problem happens with a new install of perl 5.6.0 (or 5.6.1) with *only*
installing the following modules:

  URI-1.15.tar.gz
  MIME-Base64-2.12.tar.gz
  HTML-Tagset-3.03.tar.gz
  HTML-Parser-3.25.tar.gz
  libnet-1.0703.tar.gz
  Digest-MD5-2.14.tar.gz
  libwww-perl-5.53.tar.gz


Below is a trimmed-down version of the original code, but should demonstrate
the problem.


#!/usr/local/bin/perl -w
use strict;
use LWP::UserAgent;
use HTML::LinkExtor;

    my %visited;   # track which URLs were seen
    
    my $server = {
        base_url        => 'http://www.microsoft.com/',
        link_tags       => [qw/ a frame /],
    };

    my $uri = URI->new( $server->{base_url} );
    $server->{same} = [ $uri->authority ];

    my $count = 0;

    spider( $uri );
        

#----------- Process a url and recurse -----------------------
sub spider {
    my ( $uri ) = @_;

    return if $visited{ $uri->canonical }++;


    my $ua = LWP::UserAgent->new;
    $ua->parse_head(0);   # Don't parse the content
    my $request = HTTP::Request->new('GET', $uri );
    my $response = $ua->simple_request( $request );
    my $content = $response->content;


    # Log the response
    
    print STDERR '>> ',
      join( ' ',
            ( $response->is_success ? '+Fetched' : '-Failed' ),
            ++$count,
            $response->request->uri->canonical,
            $response->code,
            $response->content_type,
            $response->content_length,
       ),"\n";


    unless ( $response->is_success ) {

        # look for redirect
        if ( $response->is_redirect && $response->header('location') ) {
            my $u = URI->new_abs( $response->header('location'), $response->base );
            spider( $u );
        }
        return;
    }

    return unless $content;

    my $links = extract_links( \$content, $response );

    # Now spider
    spider( $_ ) for @$links;

}
sub extract_links {
    my ( $content, $response ) = @_;

    return [] unless $response->header('content-type') &&
                     $response->header('content-type') =~ m[^text/html];

    my @links;

    my $base = $response->base;

    my $p = HTML::LinkExtor->new;
    $p->parse( $$content );

    my %skipped_tags;
    for ( $p->links ) {
        my ( $tag, %attr ) = @$_;

        next unless $tag eq 'a';
        
        # which are valid link
        my $links = $HTML::Tagset::linkElements{$tag};
        $links = [$links] unless ref $links;

        my $found;
        my %seen;
        for ( @$links ) {
            if ( $attr{ $_ } ) {  # ok tag

                my $u = URI->new_abs( $attr{$_},$base );
                $u->fragment( undef );

                next unless $u->scheme =~ /^http$/;
                next unless $u->host;
                next unless grep { $u->authority eq $_ } @{$server->{same}}; 

                
                push @links, $u;
                $found++;

            }
        }
    }

    return \@links;
}



~/test_perl/bin/perl -V         
Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.2.13, archname=i686-linux
    uname='linux mardy 2.2.13 #1 mon nov 8 15:51:29 cet 1999 i686 unknown '
    config_args='-Dprefix=~/test_perl -de'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=undef d_sfio=undef uselargefiles=define 
    use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef
  Compiler:
    cc='cc', optimize='-O2', gccversion=egcs-2.91.66 19990314/Linux (egcs-1.1.2 
release)
    cppflags='-fno-strict-aliasing -I/usr/local/include'
    ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE 
-D_FILE_OFFSET_BITS=64'
    stdchar='char', d_stdstdio=define, usevfork=false
    intsize=4, longsize=4, ptrsize=4, doublesize=8
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, usemymalloc=n, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lndbm -lgdbm -ldbm -ldb -ldl -lm -lc -lposix -lcrypt
    libc=, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'


Characteristics of this binary (from libperl): 
  Compile-time options: USE_LARGE_FILES
  Built under linux
  Compiled at Jul 20 2001 19:54:20


Bill Moseley
mailto:[EMAIL PROTECTED]
Memory eating problem in URI 1.12

Reply via email to