Adriano Allora wrote: > hello, > I wrote this script, and it works: it clean all the files in a 157Mb > directory in 6 minutes. > But I recently used it in a directory in which I stored only one 145 Mb > file and it is very very very slow (I suppose because it isn't > optimized: it uses memory not very well). > Someone may help me to make this script faster?
Hi Adriano. 6 minutes sounds a long time for just 157Mb of data. I suspect the reason it's so slow is mainly that it's building the new file in a single scalar variable before dumping it to disk. There is a number of things I want to point out about your script before I show you a faster version. > > > ~~~~~~~~~~~~~~~~~~~THE SCRIPT~~~~~~~~~~~~~~~~~~~~~~~~~ > > #!/usr/bin/perl -w > > use strict; Good! > my $testo; > my $var = 1; > > @ARGV = </Users/pes/Desktop/Testo140M/*.txt>; > while(<>){ This is exactly the right way to use @ARGV and <>. Well done. > tr/\015\012/\n/s; > tr/"*^_\-+' //s; > $_ = '' if /^(?:Newsgroups: it.|Subject: |Date: |Message-ID: > > References: | > From: |Message-ID: |References: )/; This statement is a little messy. I would choose to declare the regex in a separate scalar variable, and just do 'next' if it matched rather than go to the trouble of appending a null string onto the $testo string. Like this: next if $header; In my final version I've left it in-line (so as not to change your source beyond recognition) but I've used the /x modifier so that I could use whitespace to lay it out better. That meant having to use \s for each hard space that I wanted. > s/(\w\')/$1 /g; Change every occurrence single quote after a word character into a space? > $testo .=$_; > if ( eof ) { > # I don't understand why the three next steps does not work before the EOF Neither do I, and it's a little worrying if it's true. This could only happen if the matched string spread across two or more records of the input file. You need to get this fixed or you're stuck with reading the entire file into a string, and so we can't speed it up! > $testo =~ s!(?:http://)?\w{3,}(?:\.(?:\w-?)+)+\.\w{2,3}\S*! URL!g; That looks like it should replace something like a URL with ' URL' I don't think you need to be so particular about the number of characters in each group: http://id.domain.net/ wouldn't match, for instance, as it has only two characters in the server name. But that's not a problem for the minute so I've left it as it is. > $testo =~ tr/\n\r\f\t /\n\r\f\t /s; That will just change strings of identical consecutive characters into a single copy of that character, if the character is one of the 5 whitespace characters listed. It won't, for instance, change "tab, space, tab, space, tab, space" at the start of the line at all. I suspect that what you meant was: s/\s+//g; but, again, I've left it as it is. > $testo =~ s/ \n//g; Even more puzzling! If a line ends with a space followed by its terminating newline then both of these characters will be removed. It won't remove more than one space and it will delete the newline as well so that the next line joins onto this one. It looks like you're trying to remove trailing spaces on a line, in which case you want: s/\s+$//; There are several places in your program where you are deleting \n characters which will have the effect of joining consecutive lines in the output. Because this may be whatyou want I'm keeping quiet for now, but because it probably isn't what you want I'm pointing it out for you to check for yourself! > close $ARGV; This will fail. $ARGV is the name of the file currently open, and 'close' must be called on a filehandle. The file is open on filehandle ARGV, but it will probably stop you from reading any more files if you close it anyway. > open ATTUALE, ">$ARGV" or die "Non posso aprire $ARGV perche' $!\n"; I'm surprised this works at all. You're trying to open for write a file that you already have open for read. It's quite possible that this is OK if the read is at eof, or perhaps Perl opens the file for allowing for concurrent writes (which doesn't sound sensible). > print ATTUALE $testo; > close ATTUALE; > print "$var - I cleaned $ARGV\n"; > $testo = ""; > $var++; > }} > > print "DONE.\n"; All this is fine. Now to my version, which is largely the same as yours, but I'm using the in-place edit option by setting the Perl variable $^I to a null string. This is sufficient to define it, which will enable the in-place edit, but will overwrite the old file with the new data as your program did. Beware that this also sets the default output file to the new copy of the file. If you want to print diagnostics you have to write print STDOUT "Debug Text\n"; or similar. Note that the files are being read and written one line at a time, which will speed things up for you a lot. My guess is that you have relatively little memory on your machine (say 32Mb?), and to fit a whole 150Mb file into there the system had to make a lot of use of the swap file. See how you get on with this, but beware that your original files are getting overwritten each time. You should be used to this though as that's what happened with your old code. (Changing $^I to '.bak' instead will rename the old file to 'file.bak' before the new file is written, if that's any help.) Let us know how you get on. Cheers, Rob #!/usr/bin/perl -w use strict; my $var; $^I = ''; @ARGV = </Users/pes/Desktop/Testo140M/*.txt>; while (<>) { next if /^(?: Newsgroups:\s+it.| Subject:\s| Date:\s| Message-ID:\s| References:\s| From:\s| Message-ID:\s| References:\s )/x; tr/\015\012/\n/s; tr/"*^_\-+' //s; s/(\w\')/$1 /g; s!(?:http://)?\w{3,}(?:\.(?:\w-?)+)+\.\w{2,3}\S*! URL!g; tr/\n\r\f\t /\n\r\f\t /s; s/ \n//g; print; print STDOUT ++$var, " - I cleaned $ARGV\n" if eof; } -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]