On Jun 19, 2006, at 22:45, Anthony Ettinger wrote:

   # order matters
   $raw_text =~ s/\015\012/\n/g;
   $raw_text =~ s/\012/\n/g unless "\n" eq "\012";
   $raw_text =~ s/\015/\n/g unless "\n" eq "\015";


Does it make any difference if I use s/\cM\cJ/cJ/ vs. s/\015\012/\n/ g ?

The regexp is OK, the replacement string is not, because \cJ is not necessarily eq "\n". The latter is portable, the former is not.

Since the newline convention is not necessarily the one in the
runtime platform you cannot write a line-oriented script. If files
are too big to slurp then you'd work on chunks, but need to check by
hand whether a CRLF has been cut in the middle.


I'm reading each line in a while loop, so it should work fine on a large file?

The while loops over lines ***as long as they are encoded using the conventions of the runtime platform***. The diamond operator uses $/ as separator, which in turn is "\n" by default. Since the purpose of your script is to deal with *any* newline convention, in general a while loop like

  while (my $line = <$fh>) { ... }

looks suspicious. The variable should be called $chunk_of_text, instead of $line. You don't know whether you'll get a "line". Suspicious, may signal the programmer does not fully understand what's going on.

For instance, TextWrangler is known to use old-Mac conventions by default (last time I checked). If you read a file like that with that while in either Unix or Windows you'll slurp the entire file in a single iteration. That is, $line will contain the whole file.

In general, to be robust to newline conventions you need to to some munging by hand before using regular, portable line-oriented idioms.

-- fxn


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to