Re: [PATCH 0/5] Second take at a more lax From_ parser

Derek Martin Thu, 29 Jan 2026 09:07:50 -0800

On Thu, Jan 29, 2026 at 01:48:07AM +0100, Steffen Nurpmeso wrote:
> Why the Content-Length Format is Bad (a humble opinion)
> 
>   https://www.jwz.org/doc/content-length.html
> 
> More from 1996 thus.  And super current on-topic.

Yes, 1996, but the only thing that has changed since then is that
there are a lot fewer upstart e-mail clients doing it their own way,
since e-mail is a fairly well-solved problem.

Jamie Zawinsky (the jwz in the hostname), for those who don't know,
is a well-known and mostly well-regarded pioneer of a variety of
internet and software technologies, and one of the lead developers of
the Netscape web browser, and more notably in relation to this thread,
Netscape Mail.  He was also somewhat of a security expert.  Last I
heard he'd abandoned software to run a techno dance club.
[Seriously.] =8^)

What I think is especially notable in that thread is that it
completely backs up what I've been saying about the from line:

    But, here's the good news, there is no true specification of this
    file format, just a collection of word-of-mouth behaviors of the
    various programs over the last few decades which have used that
    format.

    Essentially the only safe way to parse that file format is to
    consider all lines which begin with the characters ``From ''
    (From-space), which are preceded by a blank line or
    beginning-of-file, to be the division between messages. That is,
    the delimiter is "\n\nFrom .*\n" except for the very first message
    in the file, where it is "^From .*\n".

    Some people will tell you that you should do stricter parsing on
    those lines: check for user names and dates and so on. They are
    wrong. The random crap that has traditionally been dumped into
    that line is without bound; comparing the first five characters is
    the only safe and portable thing to do. Usually, but not always,
    the next token on the line after ``From '' will be a user-id, or
    email address, or UUCP path, and usually the next thing on the
    line will be a date specification, in some format, and usually
    there's nothing after that. But you can't rely on any of this. 

Exactly.

I do also more-or-less agree with his thoughts about Content-Length,
but I wouldn't go so far as to call it brain-damaged.  It was another
attempt to solve the "Where does the next message start?" problem, and
it has some real drawbacks (which he mostly accurately describes), but
so does every other method.

> I have expressed Dr. Fink's wishes in the past.

And I have already rebutted those ideas back in August when you posted
them.  It's nonsensical bunk.

On Thu, Jan 29, 2026 at 03:46:43AM +0100, Vincent Lefevre wrote:
> > Why the Content-Length Format is Bad (a humble opinion)
> > 
> >   https://www.jwz.org/doc/content-length.html
[...]
> Yes, 1996. But...
> 
> "This latter format is non-portable, easily-corruptible, and overall,
> brain-damaged (that's a technical term.)"
> 
> Non-portable: Fortunately, it matters only locally, and the user
> can ensure that their tools handle it correctly (in particular,
> recompute the value when the message is received and whenever it
> needs to be modified).

Yes, I agree, but if you are an unfortunate user, out of some
externally imposed necessity, of some tool or other that does not do
The Right Thing™, this may not save you.

> Easily-corruptible: If the tools handle it correctly, it should not
> be corrupt (it is probably less risky than the other methods).

Unfortuantely, this is false.  If the tool were in the middle of
making an update to the content-length, or the actual content, and you
had a power failure, hardware failure, etc., then your whole mailbox
from that message on is completely useless, because the algorithm will
never find the next message where it is supposed to.  Doesn't matter
how well the tool implements the algorithm (though being able to fall
back to From_ line parsing would help a lot, provided you actually
also did From escaping in the body... but at that point, you may as
well just use From_ line parsing).

Of course, in general mbox has this problem of being corruptible,
though the extent to which that is true depends a great deal on what
tradeoffs you make implementing your folder I/O strategy.  One
approach, for maximum safety (but terrible performance for many
operations) is to rewrite the whole mailbox to a temp file, then when
that completes, rename it to the correct file name.  Your mailbox will
not be corrupted, though its state may be irreparably outdated
(without redoing whatever operation was in progress at the time of the
incident, that is).

And IIRC, UW IMAPd used to use this approach, which is why you'd start
having performance problems if you let users keep very large
mailboxes.  Or at least, it's one of the reasons... ;-)

> Brain-damaged: ???
> 
> So what was needed was just a transition from old problematic methods
> to the general use of Content-Length *locally*. But MUAs should still
> support (old) mbox files where Content-Length is not used.

Yep.  But, don't forget that at the time, most tools only supported
one or the other (or something different still), with the From_ line
method being by far the most prevalent, so at that moment in time, 
calling it brain-damaged wasn't so crazy.

-- 
Derek D. Martin    http://www.pizzashack.org/   GPG Key ID: 0xDFBEAD02
-=-=-=-=-
This message is posted from an invalid address.  Replying to it will result in
undeliverable mail due to spam prevention.  Sorry for the inconvenience.

Re: [PATCH 0/5] Second take at a more lax From_ parser

Reply via email to