Re: how to optimize it?

Rob Dixon Sun, 30 Mar 2003 16:38:50 -0800

Adriano Allora wrote:
> hello,
> I wrote this script, and it works: it clean all the files in a 157Mb
> directory in 6 minutes.
> But I recently used it in a directory in which I stored only one 145 Mb
> file and it is very very very slow (I suppose because it isn't
> optimized: it uses memory not very well).
> Someone may help me to make this script faster?


Hi Adriano.

6 minutes sounds a long time for just 157Mb of data. I suspect the reason
it's so slow is mainly that it's building the new file in a single scalar variable
before dumping it to disk. There is a number of things I want to point out
about your script before I show you a faster version.

>
>
> ~~~~~~~~~~~~~~~~~~~THE SCRIPT~~~~~~~~~~~~~~~~~~~~~~~~~
>
> #!/usr/bin/perl -w
>
> use strict;

Good!

> my $testo;
> my $var = 1;
>
> @ARGV = </Users/pes/Desktop/Testo140M/*.txt>;
> while(<>){

This is exactly the right way to use @ARGV and <>. Well done.

>          tr/\015\012/\n/s;
>          tr/"*^_\-+' //s;
>      $_ = '' if /^(?:Newsgroups: it.|Subject: |Date: |Message-ID:
> > References: |
> From: |Message-ID: |References: )/;

This statement is a little messy. I would choose to declare the regex
in a separate scalar variable, and just do 'next' if it matched rather
than go to the trouble of appending a null string onto the $testo
string. Like this:

    next if $header;

In my final version I've left it in-line (so as not to change your source beyond
recognition) but I've used the /x modifier so that I could use whitespace to
lay it out better. That meant having to use \s for each hard space that I wanted.

>          s/(\w\')/$1 /g;

Change every occurrence single quote after a word character into a space?

>          $testo .=$_;
>          if ( eof ) {
> # I don't understand why the three next steps does not work before the EOF

Neither do I, and it's a little worrying if it's true. This could only happen if
the matched string spread across two or more records of the input file.
You need to get this fixed or you're stuck with reading the entire file
into a string, and so we can't speed it up!

>      $testo =~ s!(?:http://)?\w{3,}(?:\.(?:\w-?)+)+\.\w{2,3}\S*! URL!g;

That looks like it should replace something like a URL with ' URL' I don't think
you need to be so particular about the number of characters in each group:

    http://id.domain.net/

wouldn't match, for instance, as it has only two characters in the server name.
But that's not a problem for the minute so I've left it as it is.

>          $testo =~ tr/\n\r\f\t /\n\r\f\t /s;

That will just change strings of identical consecutive characters into a single
copy of that character, if the character is one of the 5 whitespace characters
listed. It won't, for instance, change "tab, space, tab, space, tab, space" at the
start of the line at all. I suspect that what you meant was:

    s/\s+//g;

but, again, I've left it as it is.

>          $testo =~ s/ \n//g;

Even more puzzling! If a line ends with a space followed by its terminating
newline then both of these characters will be removed. It won't remove
more than one space and it will delete the newline as well so that the next
line joins onto this one. It looks like you're trying to remove trailing spaces
on a line, in which case you want:

    s/\s+$//;

There are several places in your program where you are deleting \n characters
which will have the effect of joining consecutive lines in the output. Because
this may be whatyou want I'm keeping quiet for now, but because it probably
isn't what you want I'm pointing it out for you to check for yourself!

>          close $ARGV;

This will fail. $ARGV is the name of the file currently open, and 'close'
must be called on a filehandle. The file is open on filehandle ARGV, but
it will probably stop you from reading any more files if you close it anyway.

>          open ATTUALE, ">$ARGV" or die "Non posso aprire $ARGV perche' $!\n";

I'm surprised this works at all. You're trying to open for write a file that you
already have open for read. It's quite possible that this is OK if the read is at
eof, or perhaps Perl opens the file for allowing for concurrent writes (which
doesn't sound sensible).

>          print ATTUALE $testo;
>          close ATTUALE;
>          print "$var - I cleaned $ARGV\n";
>          $testo = "";
>          $var++;
>          }}
>
> print "DONE.\n";

All this is fine. Now to my version, which is largely the same as yours, but
I'm using the in-place edit option by setting the Perl variable $^I to a null
string. This is sufficient to define it, which will enable the in-place edit,
but will overwrite the old file with the new data as your program did.

Beware that this also sets the default output file to the new copy of the
file. If you want to print diagnostics you have to write

    print STDOUT "Debug Text\n";

or similar.

Note that the files are being read and written one line at a time, which
will speed things up for you a lot. My guess is that you have relatively
little memory on your machine (say 32Mb?), and to fit a whole 150Mb
file into there the system had to make a lot of use of the swap file.

See how you get on with this, but beware that your original files
are getting overwritten each time. You should be used to this though
as that's what happened with your old code. (Changing $^I to '.bak'
instead will rename the old file to 'file.bak' before the new file
is written, if that's any help.)

Let us know how you get on.

Cheers,

Rob


    #!/usr/bin/perl -w

    use strict;

    my $var;

    $^I = '';

    @ARGV = </Users/pes/Desktop/Testo140M/*.txt>;

    while (<>) {

        next if /^(?:
                Newsgroups:\s+it.|
                Subject:\s|
                Date:\s|
                Message-ID:\s|
                References:\s|
                From:\s|
                Message-ID:\s|
                References:\s
        )/x;

        tr/\015\012/\n/s;
        tr/"*^_\-+' //s;
        s/(\w\')/$1 /g;
        s!(?:http://)?\w{3,}(?:\.(?:\w-?)+)+\.\w{2,3}\S*! URL!g;
        tr/\n\r\f\t /\n\r\f\t /s;
        s/ \n//g;

        print;

        print STDOUT ++$var, " - I cleaned $ARGV\n" if eof;
    }




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how to optimize it?

Reply via email to