-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wednesday, May 21 at 02:24 PM, quoth Wilkinson, Alex: >0n Tue, May 20, 2008 at 07:14:21PM -0500, Kyle Wheeler wrote: > >> The original reason for this script was because urlview doesn't >> correctly handle format=flowed email or any other email encodings, >> so URLs are often mishandled or simply broken. This script handles >> all known encodings *correctly* (when fed the raw email). It can be >> used either as a standalone script (which requires the Curses::UI >> perl module) or as a pre-filter for urlview. > > Ahh, now this is what i like to hear.
:) > I have a few questions: > > 1. What is meant by "format=flowed email" ? Email that is tagged as "format=flowed" (i.e. it says so in the Content-Type header) informs the receiving client that some lines are "connected" and some lines are not. Lines that end in a space are considered "to be continued". It's kinda like putting a backslash at the end of a shell-script line. This allows the client to be able to re-format and re-wrap all the lines in an email to fit whatever display width is available. Note that this is for text/plain email only; NOT HTML. Many mail clients can send format=flowed (also known as "f=f") mail, including mutt, Eudora, Apple's Mail.app, among others. Part of the motivation for f=f mail is that the email spec limits line length, and part of it is things like blackberries need to be able to redisplay email on much smaller screens than the email was written for. There's a variant of f=f email, called delsp=yes email (i.e. it has that tag in the Content-type header as well). The difference between this variant and the "standard" f=f email is in how lines are joined. Specifically, do we leave the space in there, or not? Several email clients use this technique to split long URLs over multiple lines (thus obeying the line-length restrictions of the relevant email RFCs) in a way that allows the client to reconstruct the original URL easily. In practice, this means that sentences get broken up by two spaces at the end of each line, while URLs get broken up by a single space at the end of the line. My extract_url.pl handles both kinds of format=flowed email correctly. (As an example, this email I'm sending right now is format=flowed formatted. Note that most lines end in a space.) > 2. What are the "known encodings" ? Primarily, Base64 and quoted-printable. In essence, anything that's understood by perl's MIME::Parser. > I often have broken links in the body of my emails and I don't know why e.g. > > The link is meant to look like: > > > http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/242330010/60/54/X > > But I will always see it like this in mutt: > > http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/2423300 > 10/60/54/X That kind of thing happened to me a *lot*. > When I look at the raw spool file (independent of mutt) I see: > > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> > <HTML> > <HEAD> > <META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; = > charset=3Dus-ascii"> > <META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version = > 6.5.7652.24"> > <TITLE>Link to catalogue</TITLE> > </HEAD> > <BODY> > <!-- Converted from text/rtf format --> > <BR> > > <P><A = > HREF=3D"http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/= > 242330010/60/54/X"><U><FONT COLOR=3D"#0000FF" SIZE=3D2 = > FACE=3D"Arial">http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs= > /DSTOE/242330010/60/54/X</FONT></U></A> > </P> > > Would your script deal with this annoying problem (which I still don't > understand). If it would ... I am going to use it permanently :) Yes, my script would handle that. What you have there is an HTML email that's been quoted-printable encoded. The MIME::Parser module automatically transforms the quoted-printable form: <P><A = HREF=3D"http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/= 242330010/60/54/X"><U><FONT COLOR=3D"#0000FF" SIZE=3D2 = FACE=3D"Arial">http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs= /DSTOE/242330010/60/54/X</FONT></U></A> </P> Into straight-up HTML: <P><A HREF="http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/242330010/60/54/X"><U><FONT COLOR="#0000FF" SIZE=2 FACE="Arial">http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/242330010/60/54/X</FONT></U></A> </P> And then my script runs that through an HTML parser to extract the URL from the <A HREF=""> tag. Obviously, in order to work its magic the best, my script needs access to the raw form of the email (if you feed it the pre-formatted output of a web browser like lynx, there's no way to tell whether the URL necessarily continues on the next line or not). But, given the raw form (i.e. following the directions on the web page), it will handle that. The biggest difference between extract_url.pl and urlview is that urlview is just looking for URLs in plain text (which means when the line ends, so does the URL), while extract_url.pl is looking to decode things first, and so can reconstruct URLs that have been split over multiple lines. Because of that difference, extract_url.pl can be used as a pre-filter for urlview (it just prints out all the URLs in a form that urlview can understand). This new version of extract_url.pl has the ability to do something else that urlview cannot, and that's maintain some sense of the context of a given URL from the original email. It's not perfect (take duplicate URLs for example), but I think it's worthwhile. ~Kyle - -- Reason is itself a matter of faith. It is an act of faith to assert that our thoughts have any relation to reality at all. -- G. K. Chesterton -----BEGIN PGP SIGNATURE----- Comment: Thank you for using encryption! iD8DBQFIM8qIBkIOoMqOI14RAivAAKC4ZS9kfunofrnRsEdb9ChDjpy6UgCghIag UGXsKhwXgpXPgZN7IRqfSEs= =H+py -----END PGP SIGNATURE-----
