i tried that kind of stuff - it did not seem to work.

i will try again... if anyone has any ideas i.e. "use iconv to convert
to A, then use DOM stuff, then use iconv to move it back to UTF8..."
etc. i am all ears.

On Tue, Feb 17, 2009 at 12:46 PM, Nathan Nobbe <quickshif...@gmail.com> wrote:
> On Tue, Feb 17, 2009 at 12:40 PM, mike <mike...@gmail.com> wrote:
>> Pardon the messy code, but I got this working like a charm. Then I
>> went to try it on some Russian content and it broke. The inbound was
>> utf-8 encoded Russian characters, output was something else
>> unintelligible.
>> I found a PHP bug from years ago that sounded related but the user had
>> a workaround.
>> Note that it does not appear that any of the functions break the
>> encoding - it is the ->saveHTML() that doesn't seem to work (I also
>> tried saveXML() and it did not work either?
>> I am totally up for changing out using php's DOM and using another
>> library, basically I just want to traverse the DOM and pick out all <a
>> href> and <img src> and possibly any other external references in the
>> documents so I can run them through some link examination and such. I
>> figured I may have to fall back to a regexp, but PHP's DOM was so good
>> with even partial and malformed HTML, I was excited at how easy this
>> was...
>>        $dom = new domDocument;
>>        @$dom->loadHTML($string);
>>        $dom->preserveWhiteSpace = false;
>>        $links = $dom->getElementsByTagName('a');
>>        foreach($links as $tag) {
>>                $before = $tag->getAttribute('href');
>>                $after = strip_chars($before);
>>                $after = map_url($after);
>>                $after = fix_link($after);
>>                if($after != false) {
>>                        echo "\tBEFORE: $before\n";
>>                        echo "\tAFTER : $after\n\n";
>>                        $tag->removeAttribute('href');
>>                        $tag->setAttribute('href', $after);
>>                }
>>        }
>>        return $dom->saveHTML();
>> }
>> I tried things like this:
>> new DomDocument('1.0', 'UTF-8');
>> as well as encoding options for $dom like $dom->encoding = 'utf-8' or
>> something (I tried so many variations I cannot remember anymore)
>> Anyone have any ideas?
>> As long as it can read in the string (which is and should always be
>> UTF-8) and spit out UTF-8, I can make sure any of my functions are
>> UTF-8 safe that handle the data...
> from the manual on DOM,
> Note: DOM extension uses UTF-8 encoding. Use utf8_encode() and utf8_decode()
> to work with texts in ISO-8859-1 encoding or Iconv for other encodings.
> -nathan

PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to