Pardon the messy code, but I got this working like a charm. Then I
went to try it on some Russian content and it broke. The inbound was
utf-8 encoded Russian characters, output was something else

I found a PHP bug from years ago that sounded related but the user had
a workaround.

Note that it does not appear that any of the functions break the
encoding - it is the ->saveHTML() that doesn't seem to work (I also
tried saveXML() and it did not work either?

I am totally up for changing out using php's DOM and using another
library, basically I just want to traverse the DOM and pick out all <a
href> and <img src> and possibly any other external references in the
documents so I can run them through some link examination and such. I
figured I may have to fall back to a regexp, but PHP's DOM was so good
with even partial and malformed HTML, I was excited at how easy this

        $dom = new domDocument;
        $dom->preserveWhiteSpace = false;
        $links = $dom->getElementsByTagName('a');
        foreach($links as $tag) {
                $before = $tag->getAttribute('href');
                $after = strip_chars($before);
                $after = map_url($after);
                $after = fix_link($after);
                if($after != false) {
                        echo "\tBEFORE: $before\n";
                        echo "\tAFTER : $after\n\n";
                        $tag->setAttribute('href', $after);
        return $dom->saveHTML();

I tried things like this:

new DomDocument('1.0', 'UTF-8');

as well as encoding options for $dom like $dom->encoding = 'utf-8' or
something (I tried so many variations I cannot remember anymore)

Anyone have any ideas?

As long as it can read in the string (which is and should always be
UTF-8) and spit out UTF-8, I can make sure any of my functions are
UTF-8 safe that handle the data...


PHP General Mailing List (
To unsubscribe, visit:

Reply via email to