Dear all,

I am not entirely sure that this is a MacOsX/Perl question, but I'll try asking it anyway, since it is a problem that showed up on a computer running jaguar and probably related to a perl script... if you think it's not relevant, please ignore...

So, I have been running a script that takes a list of url's as input, and dumps all the text in the corresponding pages to a file (I am a linguist and I am interested in downloading a largish corpus of internet text for research purposes).

This is the core of the script:

************************************************************************ ***************************

#!/usr/bin/perl

use strict;
use warnings;
use LWP::Simple;
use HTML::Parse;
use HTML::FormatText;
use Sys::AlarmCall;

while (<>) {

# ...

my ($url) = $_;
chomp $url;
if ($url !~/\.(ps)|(gz)|(pdf)|(gif)|(jpg)|(jpeg)|(doc)|(xls)|(ppt)|(rtf)$/i) {
my ($html_text) = alarm_call(60,'get',$url);
if ($html_text && ($html_text ne "TIMEOUT")) {
my ($text) = HTML::FormatText->new->format(parse_html($html_text));
if ($text =~/[a-zA-Z]/) {
print "CURRENT URL $url\n";
# ... (some text processing)
print "$text\n";
}
}
}
}


******************************************************

I started the script yesterday afternoon, the script ran through the night, and this morning when I woke up it was done. It had successfully processed the whole input list, which contained 17834 url's -- except the last few addresses, which were all https addresses... HOWEVER -- the computer (which runs 10.2) was very very slow -- it took forever to run even the most basic operations -- such as starting a text editor or doing a ls on an already open terminal window.

Now: the script was over and it looked like it finished its job successfully (OK -- I had no serious error handling...). I did a top, and I could not see any suspicious looking process. Still, the computer was very very slow. I had to reboot -- it also took a while to reboot, but now it is working fine again.

Since the only thing that happened on the computer btw. yesterday afternoon, when it was working fine, and this morning, when it got so slow, was the script, I suspect that the problem has something to do with the script, but I don't understand why such problem would persist even after the script was done, and leaving no trace on top...

Speaking of top, another thing I noticed last night after the script had been running for a few hours was that the script was taking up a huge amount of memory, like more than 500M of RSIZE, and this size seemed to be constantly increasing... this surprised me, since the script is not doing anything that, in my naive view, would require progressively larger memory chunks...

I would be very grateful if somebody could tell me if the script could have been the cause of the slowdown, and, in such case, what made the problem persist after the script was done, and perhaps also the probable cause of the incremental memory usage business.

Thanks a lot!

Ciao,

Marco




Reply via email to