Re: [dev] Stripping html from email

2010-08-26 Thread Kris Maglione

On Thu, Aug 26, 2010 at 11:24:11AM +0100, Kai Hendry wrote:

I noticed no one mentioned http://packages.qa.debian.org/m/mpack.html `munpack`


Indeed, I've been using mpack and ripmime for years, but I think 
that altermime would be cleaner in this case.


--
Kris Maglione

Religion began when the first scoundrel met the first fool.
--Voltaire




Re: [dev] Stripping html from email

2010-08-26 Thread Josh Rickmar
On Tue, Aug 24, 2010 at 04:58:20PM +0200, pancake wrote:
> there's dmc-pack to unpack and unpack mime attachments. The
> implementation is 162 LOC and works quite nice. I think is the
> sanest way to work with it.

dmc looks like it could be just what I need, unfortunately I can't
compile it on OpenBSD.  Any help appreciated.  If this doesn't work
out, I could always go to Kurt's perl solution.

$ gmake HAVE_SSL=0 
cc -Wall -DHAVE_SSL=0 -DVERSION=\"0.1\" -DPREFIX=\"/usr\"   -c -o dmc.o dmc.c
dmc.c: In function 'dmcinit':
dmc.c:195: warning: implicit declaration of function 'signal'
dmc.c:195: error: 'SIGINT' undeclared (first use in this function)
dmc.c:195: error: (Each undeclared identifier is reported only once
dmc.c:195: error: for each function it appears in.)
dmc.c: In function 'dmcstart':
dmc.c:230: error: 'SIGPIPE' undeclared (first use in this function)
dmc.c:230: error: 'SIG_IGN' undeclared (first use in this function)
dmc.c:256: warning: missing sentinel in function call
dmc.c: In function 'dmckill':
dmc.c:271: warning: implicit declaration of function 'kill'
dmc.c:271: error: 'SIGKILL' undeclared (first use in this function)
dmc.c: In function 'dmcstop':
dmc.c:278: error: 'SIGALRM' undeclared (first use in this function)
gmake: *** [dmc.o] Error 1



Re: [dev] Stripping html from email

2010-08-26 Thread Szabolcs Nagy
* Antoni Grzymala  [2010-08-26 12:39:33 +0200]:
> [1] uri://some.url...
> 
> notation, so that I can actually fish out the links. Is that possible
> in w3c as well?
> 

in interactive mode with 'L' you can list links and images
but i don't think there is a command line switch for that
in general w3m does not have too many command line options

to fish out urls i guess unix tools can help with that
 |tr '<' '\n' |grep -i href=




Re: [dev] Stripping html from email

2010-08-26 Thread Antoni Grzymala
Suraj Kurapati dixit (2010-08-23, 21:05):

> On Mon, Aug 23, 2010 at 8:46 PM, Anthony J. Bentley
>  wrote:
> >> Is there currently a tool or script that I can use to strip html
> >> from emails?
> >
> > mhshow-show-text/html: lynx -dump %F | less
> >
> > Lynx sucks but it sorta works well enough here, I guess.
> 
> I find that w3m does a much better job of HTML to plain-text
> conversion than Lynx.  It even renders HTML tables using Unicode
> box-drawing characters!
> 
> http://w3m.sourceforge.net/

I tried using w3m instead of lynx -dump, and it's truly better at
rendering, but lynx used the traditional blah[1]...

[1] uri://some.url...

notation, so that I can actually fish out the links. Is that possible
in w3c as well?

-- 
[a]



Re: [dev] Stripping html from email

2010-08-26 Thread Kai Hendry
I noticed no one mentioned http://packages.qa.debian.org/m/mpack.html `munpack`

I noticed this as I began working on a maildir -> Web archive thing last Sunday
http://m.dabase.com/
Very early days still.


I will definitely consider dmc-unpack instead of course.



Re: [dev] Stripping html from email

2010-08-26 Thread Nick
Quoth pancake:
> there's dmc-pack to unpack and unpack mime attachments. The 
> implementation is 162 LOC and works quite nice. I think is the sanest 
> way to work with it.

Just took a look at dmc. It looks really nice. I enjoyed reading the 
code.

Just a quick question; how are you planning to support filtering (as 
in, saving in different directories according to rules)?  Just 
dispatch off to a mda?  Or something else?

Nick


pgpNZvQx6YeSu.pgp
Description: PGP signature


Re: [dev] Stripping html from email

2010-08-25 Thread Robert Ransom
On Wed, 25 Aug 2010 22:31:58 -0400
Josh Rickmar  wrote:

> Where can I get
> the dmc source again?

See .


Robert Ransom


signature.asc
Description: PGP signature


Re: [dev] Stripping html from email

2010-08-25 Thread Josh Rickmar
On Tue, Aug 24, 2010 at 04:58:20PM +0200, pancake wrote:
>  On 08/24/10 16:45, Kurt H Maier wrote:
> >MIME sucks; there's no nice way to deal with it.  I use perl and the
> there's dmc-pack to unpack and unpack mime attachments. The
> implementation is 162 LOC and works quite nice. I think is the
> sanest way to work with it.
> 

Thanks, I'll take a look at it (not a perl fan..).  Where can I get
the dmc source again?



Re: [dev] Stripping html from email

2010-08-24 Thread anonymous
On Tue, Aug 24, 2010 at 04:26:46PM -0700, Robert Ransom wrote:
> On Tue, 24 Aug 2010 20:01:10 +0400
> The ‘tdb’ library is actually LGPLed.

Ok, tdb.h says it is under LGPL.  But both on SourceForge page and in
Arch Linux package it is said it is under GPLv3.  Probably it was just
copied from SourceForge page.  Also there is ctdb tree with GPLed
database library, maybe it is a newer version, tdb.h says tdb is
1999-2004.

Now everything is ok, I have simple and BSD licensed MDA.  Looks like
it didn't updated for half year but very useful anyway.




Re: [dev] Stripping html from email

2010-08-24 Thread Robert Ransom
On Tue, 24 Aug 2010 20:01:10 +0400
anonymous  wrote:

> Looks like it is BSD licensed but uses tdb that is GPLv3 licensed.  Is
> it ok?

The ‘tdb’ library is actually LGPLed.


Robert Ransom


signature.asc
Description: PGP signature


Re: [dev] Stripping html from email

2010-08-24 Thread Uriel
On Tue, Aug 24, 2010 at 4:45 PM, Kurt H Maier  wrote:
> MIME sucks; there's no nice way to deal with it.

Indeed.

http://harmful.cat-v.org/software/mime

uriel



Re: [dev] Stripping html from email

2010-08-24 Thread anonymous
On Mon, Aug 23, 2010 at 11:55:35PM -0400, Josh Rickmar wrote:
> Yeah, not quite what I'm looking for.  Basically I want something
> that I can pipe the message to with my MDA (fdm) before it is
> delievered to my maildir.

Thanks, I didn't know about fdm and used getmail+procmail.  Now I have
switched to fdm and my config is only 6 lines without comments and
blank lines.

Looks like it is BSD licensed but uses tdb that is GPLv3 licensed.  Is
it ok?




Re: [dev] Stripping html from email

2010-08-24 Thread pancake

 On 08/24/10 16:45, Kurt H Maier wrote:

MIME sucks; there's no nice way to deal with it.  I use perl and the
there's dmc-pack to unpack and unpack mime attachments. The 
implementation is 162 LOC and works quite nice. I think is the sanest 
way to work with it.




Re: [dev] Stripping html from email

2010-08-24 Thread Kurt H Maier
On Tue, Aug 24, 2010 at 9:27 AM, Josh Rickmar  wrote:
> anonymous is right, I just want to remove the text/html attachments,
> not strip the html tags.

MIME sucks; there's no nice way to deal with it.  I use perl and the
Mail::Message package from cpan.

--
#!/usr/bin/perl

use Mail::Message;

my $message = Mail::Message->read(\*STDIN);

if ($message->isMultipart) {
foreach my $part ( $message->parts ) {
if ( $part->contentType eq 'text/html' ) {
$part->delete;
}
}
}

$message->print(\*STDOUT);
--

That will delete html attachments, but only from multipart messages
(so html-only mail will be left alone).  You just cat the message to
it and it outputs the message (properly restructured if necessary) to
stdout


-- 
# Kurt H Maier



Re: [dev] Stripping html from email

2010-08-24 Thread Josh Rickmar
On Tue, Aug 24, 2010 at 09:07:25AM -0400, Kurt H Maier wrote:
> On Tue, Aug 24, 2010 at 9:01 AM, anonymous  wrote:
> > But it is not what OP asks for. ?Tool should process MIME emails and
> > remove text/html attachments.
> 
> that is a different task than stripping html from email data.  OP
> should be looking for two tools.

anonymous is right, I just want to remove the text/html attachments,
not strip the html tags.



Re: [dev] Stripping html from email

2010-08-24 Thread Kurt H Maier
On Tue, Aug 24, 2010 at 9:01 AM, anonymous  wrote:
> But it is not what OP asks for.  Tool should process MIME emails and
> remove text/html attachments.

that is a different task than stripping html from email data.  OP
should be looking for two tools.

-- 
# Kurt H Maier



Re: [dev] Stripping html from email

2010-08-24 Thread anonymous
On Tue, Aug 24, 2010 at 08:57:12AM -0400, Kurt H Maier wrote:
> On Tue, Aug 24, 2010 at 8:38 AM, Nick  wrote:
> > On Tue, Aug 24, 2010 at 07:31:18AM -0400, Kurt H Maier wrote:
> >> http://search.cpan.org/~kilinrax/HTML-Strip-1.06/Strip.pm
> >
> > Umm. Is no-one reading the body of the original request? We can all
> > strip XML easily, that isn't the question.
> 
> 
> On Mon, Aug 23, 2010 at 11:32 PM, Josh Rickmar  
> wrote:
> > Is there currently a tool or script that I can use to strip html
> > from emails?
> 
> DESCRIPTION ^
> 
> This module simply strips HTML-like markup from text in a very quick
> and brutal manner.

But it is not what OP asks for.  Tool should process MIME emails and
remove text/html attachments.




Re: [dev] Stripping html from email

2010-08-24 Thread Kurt H Maier
On Tue, Aug 24, 2010 at 8:38 AM, Nick  wrote:
> On Tue, Aug 24, 2010 at 07:31:18AM -0400, Kurt H Maier wrote:
>> http://search.cpan.org/~kilinrax/HTML-Strip-1.06/Strip.pm
>
> Umm. Is no-one reading the body of the original request? We can all
> strip XML easily, that isn't the question.


On Mon, Aug 23, 2010 at 11:32 PM, Josh Rickmar  wrote:
> Is there currently a tool or script that I can use to strip html
> from emails?

DESCRIPTION ^

This module simply strips HTML-like markup from text in a very quick
and brutal manner.

-- 
# Kurt H Maier



Re: [dev] Stripping html from email

2010-08-24 Thread pancake

 On 08/24/10 14:38, Nick wrote:

On Tue, Aug 24, 2010 at 07:31:18AM -0400, Kurt H Maier wrote:

http://search.cpan.org/~kilinrax/HTML-Strip-1.06/Strip.pm

Umm. Is no-one reading the body of the original request? We can all
strip XML easily, that isn't the question.


pacman -S html2text



Re: [dev] Stripping html from email

2010-08-24 Thread pancake

 On 08/24/10 05:46, Anthony J. Bentley wrote:

Is there currently a tool or script that I can use to strip html
from emails?  Basically, it should work like this:

- Read the message from stdin
- If there is no html, leave as is
- If it finds both html and plain text, strip the html attachment
- If it finds html but no plain text, leave as is

In case something like this doesn't exist, I wouldn't mind writing
one for myself (awk sounds like the right tool for the job).

It’s not quite what you’re asking for, but I have nmh set up like this:
mhshow-show-text/html: lynx -dump %F | less

Lynx sucks but it sorta works well enough here, I guess.


the encoding of your mail is wrong



Re: [dev] Stripping html from email

2010-08-24 Thread Nick
On Tue, Aug 24, 2010 at 07:31:18AM -0400, Kurt H Maier wrote:
> http://search.cpan.org/~kilinrax/HTML-Strip-1.06/Strip.pm

Umm. Is no-one reading the body of the original request? We can all 
strip XML easily, that isn't the question.



Re: [dev] Stripping html from email

2010-08-24 Thread Kurt H Maier
http://search.cpan.org/~kilinrax/HTML-Strip-1.06/Strip.pm

-- 
# Kurt H Maier



Re: [dev] Stripping html from email

2010-08-24 Thread Etienne Millon
On Tue, Aug 24, 2010 at 07:45:17AM +0100, Kai Hendry wrote:
> It would be great if there was a tool to convert HTML to markdown. ;)

Actually, pandoc can do that. :-)

-- 
Etienne Millon



Re: [dev] Stripping html from email

2010-08-24 Thread Anselm R Garbe
On Mon, Aug 23, 2010 at 10:55:14PM -0500, Stanley Lieber wrote:
> On Mon, Aug 23, 2010 at 10:46 PM, Anthony J. Bentley
>  wrote:
> >
> > It’s not quite what you’re asking for, but I have nmh set up like this:
> > mhshow-show-text/html: lynx -dump %F | less
> >
> > Lynx sucks but it sorta works well enough here, I guess.
> 
> also see htmlfmt:
> 
> http://swtch.com/plan9port/man/man1/fmt.html

I have been asked to add this to 9base, and it's on TODO for next release.

-Anselm




Re: [dev] Stripping html from email

2010-08-23 Thread Kai Hendry
It would be great if there was a tool to convert HTML to markdown. ;)



Re: [dev] Stripping html from email

2010-08-23 Thread Benjamin R. Haskell
On Mon, 23 Aug 2010, Suraj Kurapati wrote:

> On Mon, Aug 23, 2010 at 8:46 PM, Anthony J. Bentley wrote:
> >> Is there currently a tool or script that I can use to strip html 
> >> from emails?
> >
> > mhshow-show-text/html: lynx -dump %F | less
> >
> > Lynx sucks but it sorta works well enough here, I guess.
> 
> I find that w3m does a much better job of HTML to plain-text 
> conversion than Lynx.  It even renders HTML tables using Unicode 
> box-drawing characters!
> 
> http://w3m.sourceforge.net/
> 

Wow.  Thanks for that.  I've always preferred 'links' to 'lynx', but 
'w3m' just dethroned it.

For the crappy HTML emails I deal with at work that assume everyone uses 
HTML-based email, I had to add an explicit type:

w3m -dump -T text/html

-- 
Best,
Ben



Re: [dev] Stripping html from email

2010-08-23 Thread Suraj Kurapati
On Mon, Aug 23, 2010 at 8:46 PM, Anthony J. Bentley
 wrote:
>> Is there currently a tool or script that I can use to strip html
>> from emails?
>
> mhshow-show-text/html: lynx -dump %F | less
>
> Lynx sucks but it sorta works well enough here, I guess.

I find that w3m does a much better job of HTML to plain-text
conversion than Lynx.  It even renders HTML tables using Unicode
box-drawing characters!

http://w3m.sourceforge.net/



Re: [dev] Stripping html from email

2010-08-23 Thread Stanley Lieber
On Mon, Aug 23, 2010 at 10:46 PM, Anthony J. Bentley
 wrote:
>
> It’s not quite what you’re asking for, but I have nmh set up like this:
> mhshow-show-text/html: lynx -dump %F | less
>
> Lynx sucks but it sorta works well enough here, I guess.

also see htmlfmt:

http://swtch.com/plan9port/man/man1/fmt.html

-sl



Re: [dev] Stripping html from email

2010-08-23 Thread Josh Rickmar
On Mon, Aug 23, 2010 at 09:46:58PM -0600, Anthony J. Bentley wrote:
> > Is there currently a tool or script that I can use to strip html
> > from emails?  Basically, it should work like this:
> > 
> > - Read the message from stdin
> > - If there is no html, leave as is
> > - If it finds both html and plain text, strip the html attachment
> > - If it finds html but no plain text, leave as is
> > 
> > In case something like this doesn't exist, I wouldn't mind writing
> > one for myself (awk sounds like the right tool for the job).
> 
> It???s not quite what you???re asking for, but I have nmh set up like this:
> mhshow-show-text/html: lynx -dump %F | less
> 
> Lynx sucks but it sorta works well enough here, I guess.
> 

Yeah, not quite what I'm looking for.  Basically I want something
that I can pipe the message to with my MDA (fdm) before it is
delievered to my maildir.



Re: [dev] Stripping html from email

2010-08-23 Thread Anthony J. Bentley
> Is there currently a tool or script that I can use to strip html
> from emails?  Basically, it should work like this:
> 
> - Read the message from stdin
> - If there is no html, leave as is
> - If it finds both html and plain text, strip the html attachment
> - If it finds html but no plain text, leave as is
> 
> In case something like this doesn't exist, I wouldn't mind writing
> one for myself (awk sounds like the right tool for the job).

It’s not quite what you’re asking for, but I have nmh set up like this:
mhshow-show-text/html: lynx -dump %F | less

Lynx sucks but it sorta works well enough here, I guess.



[dev] Stripping html from email

2010-08-23 Thread Josh Rickmar
Is there currently a tool or script that I can use to strip html
from emails?  Basically, it should work like this:

- Read the message from stdin
- If there is no html, leave as is
- If it finds both html and plain text, strip the html attachment
- If it finds html but no plain text, leave as is

In case something like this doesn't exist, I wouldn't mind writing
one for myself (awk sounds like the right tool for the job).