Hi Harvey,

If you can post-process using regex, can't you also post-process using
Tidy? Sample code:

<?php
        $bad_markup  = '<img src=http://www.domain.com/a.jpg/>';
        $tidy_config = array(
                'show-body-only' => TRUE
        );
        $tidy        = new tidy();
        $good_markup = $tidy->repairString($bad_markup, $tidy_config);

        echo ($good_markup); // outputs: <img src="http://www.domain.com/
a.jpg/">


On Jun 29, 8:50 am, [email protected] wrote:
> Hi,
>
> There are a few reasons why I like the idea of regex in this instance.
> The main one being that I'm not actually looking to do a full clean-up
> of the HTML - just fix this one specific problem which is breaking other
> post-processing regexes that assume attributes are quoted. The offending
> HTML is being generated by a WYSIWYG, so any attempts to clean the HTML
> externally and reimport are going to be thwarted next time the user
> saves their page. Which is also why I prefer the post-processing regex
> approach in this case.
>
> But if I was trying to fix/validate a complete document, I definitely
> understand why a parser-based approach like Tidy is a better idea than
> regexing.
>
> Thanks,
>
> Harvey.
>
> On 28/06/2011 11:39 p.m., .Net2Php wrote:
>
>
>
>
>
>
>
>
>
> > Hi Harvey,
>
> > Regex is probably not the best thing to use to fix HTML. HTML Tidy
> > will probably be a better solution.
>
> > Looking at your regex, a few comments:
>
> > - Do you really need to use \s (which will match a space, tab,
> > carriage return, new line) or will a space suffice?
> > - The pattern in the capturing parentheses probably could be
> > simplified to something like: .*?
> > -- NOTE: you would wrap that pattern in capturing parentheses and put
> > a trailing space after the closing parenthesis
>
> > Hard to do regex here, but maybe something like this (untested):
>
> > src *= *(.*?)
>
> > NOTE: there is a trailing space in the regex. The replacement string
> > would be something like this (untested again):
>
> > "$1"
>
> > Hope this helps.
>
> > On Jun 28, 5:50 pm, [email protected] wrote:
> >> Thanks for the replies everyone. My mail is with Webdrive so I lost
> >> email shortly after posting this request, so I couldn't check replies or
> >> reply myself any sooner. I managed to find my own solution in the meantime.
>
> >> In this case, I only really cared about missing src attributes in img
> >> tags, so this is what I came up with.
>
> >> src\s*=\s*([/a-zA-z0-9].*?)(>|( [a-z]+)=)
>
> >> Which needs to be run at least twice to clean all attributes in a tag.
>
> >> Thanks,
>
> >> Harvey.
>
> >> On 28/06/2011 10:24 a.m., Matthew Whyte wrote:
>
> >>> Hi Harvey,
> >>> I don't have a regex handy, but from memory the last time I needed to
> >>> do something similar I used the "clean up HTML" option in Dreamweaver,
> >>> which did the trick. (I don't use Dreamweaver for anything else, I've
> >>> only got it because it came part of the Adobe Suite!)
> >>> Cheers,
> >>> Matthew Whyte
> >>> Managing Director | digiCreative
> >>> T
> >>> +64 7 959 8230
> >>> F
> >>> +64 7 974 9059
> >>> E
> >>> [email protected]<mailto:[email protected]>
> >>> W
> >>> digicreative.co.nz<http://digicreative.co.nz/>
> >>> digiCreative
> >>> 5 King St | PO Box 19492, Hamilton, New Zealand
> >>> ------------------------------------------------------------------------
> >>> The content of this email is confidential and may be legally
> >>> privileged.  If it is not intended for you, please email the sender
> >>> immediately and destroy the original message.
> >>> On Tue, Jun 28, 2011 at 10:17 AM,<[email protected]
> >>> <mailto:[email protected]>>  wrote:
> >>>      Hi All,
> >>>      I need to fix up some sloppy HTML which is (in some cases) missing
> >>>      quotes around the HTML attributes.
> >>>      eg<img src=filename.jpg width=100 height=100>
> >>>      Does anyone have a tested regex sitting in their collection for
> >>>      adding back in those missing quotes?
> >>>      Thanks,
> >>>      Harvey.
> >>>      --
> >>>      Harvey Kane
> >>>      Phone:
> >>>      - Auckland: +64 9 950 4133
> >>>      - Wanaka: +64 3 746 8133
> >>>      - Mobile: +64 21 811 951
> >>>      Email: [email protected]<mailto:[email protected]>
> >>>       If you need to contact me urgently, please read my email policy
> >>>    www.ragepank.com/email/<http://www.ragepank.com/email/>
> >>>      --
> >>>      NZ PHP Users Group:http://groups.google.com/group/nzphpug
> >>>      To post, send email to [email protected]
> >>>      <mailto:[email protected]>
> >>>      To unsubscribe, send email to
> >>>      [email protected]
> >>>      <mailto:nzphpug%[email protected]>
> >>> --
> >>> NZ PHP Users Group:http://groups.google.com/group/nzphpug
> >>> To post, send email to [email protected]
> >>> To unsubscribe, send email to
> >>> [email protected]
> >> --
> >> Harvey Kane
>
> >> Phone:
> >> - Auckland: +64 9 950 4133
> >> - Wanaka: +64 3 746 8133
> >> - Mobile: +64 21 811 951
>
> >> Email: [email protected]
> >>    If you need to contact me urgently, please read my email 
> >> policywww.ragepank.com/email/
>
> --
> Harvey Kane
>
> Phone:
> - Auckland: +64 9 950 4133
> - Wanaka: +64 3 746 8133
> - Mobile: +64 21 811 951
>
> Email: [email protected]
>   If you need to contact me urgently, please read my email 
> policywww.ragepank.com/email/

-- 
NZ PHP Users Group: http://groups.google.com/group/nzphpug
To post, send email to [email protected]
To unsubscribe, send email to
[email protected]

Reply via email to