Hi Harvey,
If you can post-process using regex, can't you also post-process using
Tidy? Sample code:
<?php
$bad_markup = '<img src=http://www.domain.com/a.jpg/>';
$tidy_config = array(
'show-body-only' => TRUE
);
$tidy = new tidy();
$good_markup = $tidy->repairString($bad_markup, $tidy_config);
echo ($good_markup); // outputs: <img src="http://www.domain.com/
a.jpg/">
On Jun 29, 8:50 am, [email protected] wrote:
> Hi,
>
> There are a few reasons why I like the idea of regex in this instance.
> The main one being that I'm not actually looking to do a full clean-up
> of the HTML - just fix this one specific problem which is breaking other
> post-processing regexes that assume attributes are quoted. The offending
> HTML is being generated by a WYSIWYG, so any attempts to clean the HTML
> externally and reimport are going to be thwarted next time the user
> saves their page. Which is also why I prefer the post-processing regex
> approach in this case.
>
> But if I was trying to fix/validate a complete document, I definitely
> understand why a parser-based approach like Tidy is a better idea than
> regexing.
>
> Thanks,
>
> Harvey.
>
> On 28/06/2011 11:39 p.m., .Net2Php wrote:
>
>
>
>
>
>
>
>
>
> > Hi Harvey,
>
> > Regex is probably not the best thing to use to fix HTML. HTML Tidy
> > will probably be a better solution.
>
> > Looking at your regex, a few comments:
>
> > - Do you really need to use \s (which will match a space, tab,
> > carriage return, new line) or will a space suffice?
> > - The pattern in the capturing parentheses probably could be
> > simplified to something like: .*?
> > -- NOTE: you would wrap that pattern in capturing parentheses and put
> > a trailing space after the closing parenthesis
>
> > Hard to do regex here, but maybe something like this (untested):
>
> > src *= *(.*?)
>
> > NOTE: there is a trailing space in the regex. The replacement string
> > would be something like this (untested again):
>
> > "$1"
>
> > Hope this helps.
>
> > On Jun 28, 5:50 pm, [email protected] wrote:
> >> Thanks for the replies everyone. My mail is with Webdrive so I lost
> >> email shortly after posting this request, so I couldn't check replies or
> >> reply myself any sooner. I managed to find my own solution in the meantime.
>
> >> In this case, I only really cared about missing src attributes in img
> >> tags, so this is what I came up with.
>
> >> src\s*=\s*([/a-zA-z0-9].*?)(>|( [a-z]+)=)
>
> >> Which needs to be run at least twice to clean all attributes in a tag.
>
> >> Thanks,
>
> >> Harvey.
>
> >> On 28/06/2011 10:24 a.m., Matthew Whyte wrote:
>
> >>> Hi Harvey,
> >>> I don't have a regex handy, but from memory the last time I needed to
> >>> do something similar I used the "clean up HTML" option in Dreamweaver,
> >>> which did the trick. (I don't use Dreamweaver for anything else, I've
> >>> only got it because it came part of the Adobe Suite!)
> >>> Cheers,
> >>> Matthew Whyte
> >>> Managing Director | digiCreative
> >>> T
> >>> +64 7 959 8230
> >>> F
> >>> +64 7 974 9059
> >>> E
> >>> [email protected]<mailto:[email protected]>
> >>> W
> >>> digicreative.co.nz<http://digicreative.co.nz/>
> >>> digiCreative
> >>> 5 King St | PO Box 19492, Hamilton, New Zealand
> >>> ------------------------------------------------------------------------
> >>> The content of this email is confidential and may be legally
> >>> privileged. If it is not intended for you, please email the sender
> >>> immediately and destroy the original message.
> >>> On Tue, Jun 28, 2011 at 10:17 AM,<[email protected]
> >>> <mailto:[email protected]>> wrote:
> >>> Hi All,
> >>> I need to fix up some sloppy HTML which is (in some cases) missing
> >>> quotes around the HTML attributes.
> >>> eg<img src=filename.jpg width=100 height=100>
> >>> Does anyone have a tested regex sitting in their collection for
> >>> adding back in those missing quotes?
> >>> Thanks,
> >>> Harvey.
> >>> --
> >>> Harvey Kane
> >>> Phone:
> >>> - Auckland: +64 9 950 4133
> >>> - Wanaka: +64 3 746 8133
> >>> - Mobile: +64 21 811 951
> >>> Email: [email protected]<mailto:[email protected]>
> >>> If you need to contact me urgently, please read my email policy
> >>> www.ragepank.com/email/<http://www.ragepank.com/email/>
> >>> --
> >>> NZ PHP Users Group:http://groups.google.com/group/nzphpug
> >>> To post, send email to [email protected]
> >>> <mailto:[email protected]>
> >>> To unsubscribe, send email to
> >>> [email protected]
> >>> <mailto:nzphpug%[email protected]>
> >>> --
> >>> NZ PHP Users Group:http://groups.google.com/group/nzphpug
> >>> To post, send email to [email protected]
> >>> To unsubscribe, send email to
> >>> [email protected]
> >> --
> >> Harvey Kane
>
> >> Phone:
> >> - Auckland: +64 9 950 4133
> >> - Wanaka: +64 3 746 8133
> >> - Mobile: +64 21 811 951
>
> >> Email: [email protected]
> >> If you need to contact me urgently, please read my email
> >> policywww.ragepank.com/email/
>
> --
> Harvey Kane
>
> Phone:
> - Auckland: +64 9 950 4133
> - Wanaka: +64 3 746 8133
> - Mobile: +64 21 811 951
>
> Email: [email protected]
> If you need to contact me urgently, please read my email
> policywww.ragepank.com/email/
--
NZ PHP Users Group: http://groups.google.com/group/nzphpug
To post, send email to [email protected]
To unsubscribe, send email to
[email protected]