The editing dept does a Save As...*.html on all the MS-Word files we
publish.  However, in the process, each line in the new HTML file now ends
with a paragraph mark.  So, I am trying to write a script that deletes HTML
tags over new lines (which I got to work), but also over paragraph marks.

What I have so far is below, the 2nd and 3rd lines from the bottom are
examples of tags that span multi-lines, and in the process, span the
paragraph marks.  Also, I know it is not actually *doing* anything now, I am
still in the testing phase, which is why all the COLOR constants are
specified...
____________________
#! /usr/bin/perl
use warnings;
use strict;

use Term::ANSIColor qw(:constants);
$Term::ANSIColor::AUTORESET = 1;

# $/ = "";  ###I tried it with this uncommented, the whole file becomes a
big "paragraph", and nothing matches.
while (<>) {
 #remove weird paragraph marks
  s/<\/?o:p>//msgi && print "$i: $`", ON_MAGENTA "|$&|", RESET "$'\n";

 #remove unecessary closing tags
  s/<\/b>//msgi && print "$i: $`", YELLOW "|$&|", RESET "$'\n";
  s/<\/span>//msgi && print "$i: $`", ON_GREEN "|$&|", RESET "$'\n";

 #remove mso-spaceruns
  s/<span\s*(\S+\s*\S+)\">/ /msgi && print  "$i: $`", ON_RED "|$&|", RESET
"$'\n"; #***this is one tag that spans multi lines
 #remove mso image data
  s/<!--\[if gte vml 1\]>.*<!\[endif\]-->//msgi && print  "$i: $`", GREEN
"|$&|", RESET "$'\n"; #***this is one tag that spans multi lines
  s/(v:shapes\S+\s)//msgi && print  "$i: $`", ON_BLUE "|$&|", RESET "$'\n";
}




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to