Pardon the messy code, but I got this working like a charm. Then I
went to try it on some Russian content and it broke. The inbound was
utf-8 encoded Russian characters, output was something else
unintelligible.
I found a PHP bug from years ago that sounded related but the user had
a workaround.
Note that it does not appear that any of the functions break the
encoding - it is the ->saveHTML() that doesn't seem to work (I also
tried saveXML() and it did not work either?
I am totally up for changing out using php's DOM and using another
library, basically I just want to traverse the DOM and pick out all <a
href> and <img src> and possibly any other external references in the
documents so I can run them through some link examination and such. I
figured I may have to fall back to a regexp, but PHP's DOM was so good
with even partial and malformed HTML, I was excited at how easy this
was...
$dom = new domDocument;
@$dom->loadHTML($string);
$dom->preserveWhiteSpace = false;
$links = $dom->getElementsByTagName('a');
foreach($links as $tag) {
$before = $tag->getAttribute('href');
$after = strip_chars($before);
$after = map_url($after);
$after = fix_link($after);
if($after != false) {
echo "\tBEFORE: $before\n";
echo "\tAFTER : $after\n\n";
$tag->removeAttribute('href');
$tag->setAttribute('href', $after);
}
}
return $dom->saveHTML();
}
I tried things like this:
new DomDocument('1.0', 'UTF-8');
as well as encoding options for $dom like $dom->encoding = 'utf-8' or
something (I tried so many variations I cannot remember anymore)
Anyone have any ideas?
As long as it can read in the string (which is and should always be
UTF-8) and spit out UTF-8, I can make sure any of my functions are
UTF-8 safe that handle the data...
Thanks
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php