On Sun, 26 Sep 1999, Bob Meyer wrote:
> Yea that looks like a better way. I only wanted to remove the the part of
> the line put there by sendmail, but removing any white space might be a good
> idea for the BBS system. My big mistake the first time around was not to
> limit the number of times it split. I should have done something like
> this...
> ($junk,$subject) = split(/: /,$subject,2);
This won't work since you might have ' ' or '\t' after the ':'. Some
MTAs and MUAs prefer to use '\t' instead of ' ' (although this is
discouraged by RFC822).
> > Also, you still might fail on a multi-line subject entry if your parser
> > does not handle them earlier.
>
> Yup it'll ignore anything past the first newline. A subject is only suppose
> to be one line isn't it?
Nope, see RFC822 "STANDARD FOR THE FORMAT OF ARPA INTERNET TEXT
MESSAGES", part 3.1.1 "LONG HEADER FIELDS". Headers are single-line (a
newline is not significant in the content), but they may be folded on
multiple lines if the line length exceeds some sane value (usually 70-80
characters). So a looong subject line might be represented in the headers
on multiple lines (as the Received headers often are, for example), you're
supposed to unfold them (remove the \n\s+). It is a very good idea to read
through RFC822 if you're writing a parser for it, it's not that long after
all.
I once wrote something like this:
while ($line = <>) {
chomp $line;
if ($line eq "") {
last;
}
if ($line =~ /^\s.+$/) {
@hdrs[$#hdrs] .= $line;
} else {
push @hdrs, $line;
}
}
And then went through @hdrs, cleaning them up and parsing them.
(yes, rfc822 says CRLF is used for end-of-line, but unix software
generally feeds you with \n alone.)
- Hessu