-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
> The following regexp strips most of the Microsoft "XML" crap, e.g. <![if
> !supportEmptyParas]> :
>
> s/<\![^>]*>//g;
Very nice. I've modified your regex a bit and extended it, here's
some more code to play with (based on some other ideas from people) below.
There's also Wp2Html[1], which is supposed to do quite a good job of
converting the MS-HTML (and WordPerfect) back to "normal" HTML. I haven't
tried it, so if someone could give it a go and let me know, I can add that
to the FAQ as well.
Some other tools to look at are HTML tidy[2], demoroniser[3], wv[4],
and WordFilter[5]. Each has their own niche. I prefer the perl solution of
course.
Another alternate solution, to grab the actual data out of a
Microsoft Word document directly, is using this small snippet:
use strict; # of course!
use Win32::OLE; # will only install on Win32 systems
my $word = Win32::OLE->new('word.application');
my $doc = $word->Documents->Open('C:\file.doc');
# Your data is in $text
my $text = $doc->{Text};
- ----
# Select the core attributes to ignore
my @ignore_attr = qw (bgcolor background color face style link alink vlink
text onblur onchange onclick ondblclick onfocus
onkeydown onkeyup onload onmousedown onmousemove
onmouseout onmouseover onmouseup onreset onselect
onunload class xmlns:w xmlns:o xmlns
);
# tags to ignore
my @ignore_tags = qw(font big small body dir html div span);
# tags to drop with content
my @ignore_elements = qw(script style head o:p);
sub un_mshtml {
use HTML::TreeBuilder;
my $input = shift;
my $warn = 0;
my $htmlex;
my $h = HTML::TreeBuilder->new;
$h->ignore_unknown(0);
$h->warn($warn);
$h->parse($input);
# Drop all unwanted tags
foreach (@Conf::ignore_tags) {
$htmlex = 1, next if lc($_) eq "html";
while ( my $ok = $h->look_down( '_tag', "$_" ) ) {
$ok->replace_with_content;
}
}
# Drop all unwanted elements (tags w/content)
foreach (@Conf::ignore_elements) {
while ( my $ok = $h->look_down( '_tag', "$_" ) ) {
$ok->detach;
}
}
# Drop all unwanted attributes
foreach my $attr (@Conf::ignore_attr) {
while (my $ok = $h->look_down(
sub { defined($_[0]->attr($attr)) } ))
{
$ok->attr($attr, undef);
}
}
# Drop unwanted script code <![....]>
foreach my $ok ( $h->look_down( sub {
grep { /^<\s*!\[.+?\]\s*>$/ } $_[0]->content_list;
}
{
$ok->detach_content;
}
my $output = $h->as_HTML( undef, " ", {} );
# params = entities to encode, indent, optional endtags
$h = $h->delete();
if ($htmlex) {
$output =~ s:^\s*<html>::m;
$output =~ s:</html>\s*$::m;
}
return $output;
}
[1] http://www.res.bbsrc.ac.uk/wp2html/
[2] http://www.w3.org/People/Raggett/tidy/
[3] http://www.perl.com/language/misc/demoroniser
[4] http://www.wvware.com
[5] http://office.microsoft.com/downloads/2000/Msohtmf2.aspx
d.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
iD8DBQE9233GkRQERnB1rkoRAnOpAJ0YBLLWfdDrCF+sqVwU2MJHbeh/LQCeIDdE
jhohbaeAERgf46wtZbP7jFI=
=M77X
-----END PGP SIGNATURE-----
_______________________________________________
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev