On Sun, 26 Sep 1999, Bob Meyer wrote:

> Yea that looks like a better way.  I only wanted to remove the the part of
> the line put there by sendmail, but removing any white space might be a good
> idea for the BBS system.  My big mistake the first time around was not to
> limit the number of times it split.  I should have done something like
> this...
> ($junk,$subject) = split(/: /,$subject,2);

  This won't work since you might have ' ' or '\t' after the ':'. Some
MTAs and MUAs prefer to use '\t' instead of ' ' (although this is
discouraged by RFC822).

> >   Also, you still might fail on a multi-line subject entry if your parser
> > does not handle them earlier.
> 
> Yup it'll ignore anything past the first newline.  A subject is only suppose
> to be one line isn't it?

  Nope, see RFC822 "STANDARD FOR THE FORMAT OF ARPA INTERNET TEXT
MESSAGES", part 3.1.1 "LONG HEADER FIELDS". Headers are single-line (a
newline is not significant in the content), but they may be folded on
multiple lines if the line length exceeds some sane value (usually 70-80
characters). So a looong subject line might be represented in the headers
on multiple lines (as the Received headers often are, for example), you're
supposed to unfold them (remove the \n\s+). It is a very good idea to read
through RFC822 if you're writing a parser for it, it's not that long after
all.

  I once wrote something like this:

while ($line = <>) {
        chomp $line;
        
        if ($line eq "") { 
                last;
        }
        
        if ($line =~ /^\s.+$/) {
                @hdrs[$#hdrs] .= $line;
        } else {
                push @hdrs, $line;
        }
}

  And then went through @hdrs, cleaning them up and parsing them.
(yes, rfc822 says CRLF is used for end-of-line, but unix software
generally feeds you with \n alone.)

  - Hessu

Reply via email to