i haven't read the latest messages yet, but here's an awk script
(with sed annexe), which performs the same deletions of <P>, </P> & <BR>
which i did by hand in www.chass.utoronto.ca/~purslow/trst.html
(there's no <BR> there, but other tests suggest it works);
it does assume < & > alternate, but otherwise seems robust.
to try it out, run the script on an HTML file of your choice;
disclaimer: awk & sed may perform differently on different systems.
i hear the jackals howling & see the hyenas slavering ...
awk 'BEGIN { FS = "<" ; u=0 }
$0 == "" {print $0}
$0 != "" {for (i=1; i<=NF; i++) {
t=0;
if ($i ~ /^[Tt][Aa][Bb][Ll][Ee].*>.*/) u=u+1;
if ($i ~ /^\/[Tt][Aa][Bb][Ll][Ee]/) {u=u-1; if (u<0) u=0};
if (u>0 && ($i ~ /^[Pp]/ || $i ~ /^\/[Pp]/ || $i ~ /^[Bb][Rr]/ ))
{o=">"; split($i,a,o); printf "#####"; printf "%s", a[2]; t=1};
if (t==0) {
if ($i ~ />/) printf "<";
printf "%s", $i }}; printf "\n" }
' $* |
sed 's/#####//g
'
[NB the final ' ]
--
========================,,============================================
SUPPORT ___________//___, Philip Webb : [EMAIL PROTECTED]
ELECTRIC /] [] [] [] [] []| Centre for Urban & Community Studies
TRANSIT `-O----------O---' University of Toronto