Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”

monty Thu, 28 Jul 2016 14:45:45 -0700

You're double decoding. Use onFileNamed:/parseFileNamed: instead (and the DOM 
printToFileNamed: family of messages when writing) and let XMLParser take care 
this for you, or disable XMLParser decoding before parsing with 
#decodesCharacters:.


Longer explanation:

The class #on:/#parse: take either a string or a stream (read the definitions). 
You gave it a FileReference, but because the argument is tested with isString 
and sent #readStream otherwise, it didn't blowup then.

File refs sent #readStream return file streams that do automatic decoding. But 
XMLParser automatically attempts its own decoding too, if:

 The input starts with a BOM or it can be inferred by null bytes before or 
after the first non-null byte.

 There is an encoding declaration with a non-UTF-8 encoding.

 There is a UTF-8 encoding declaration but the stream is not a normal 
ReadStream (your case).

So it gets decoded twice, and the decoded value of the char causes the error. 
I'll consider changing the heuristic to make less eager to decode.

> Sent: Thursday, July 28, 2016 at 4:05 PM
> From: "Sean P. DeNigris" <[email protected]>
> To: [email protected]
> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> monty-3 wrote
> > Just to be sure, I manually recreated your file (with the great Bless hex
> > editor) and parsed it with no issue.
> 
> Thanks!
> 
> 
> monty-3 wrote
> > Please post your code and attach the actual source as a file separately.
> 
> The code is merely:
>   messageLog := FileLocator home / 'illegal-UTF-sms.xml'. 
>   doc := XMLDOMParser parse: messageLog.
> 
> File:  illegal-UTF-sms.xml
> <http://forum.world.st/file/n4908531/illegal-UTF-sms.xml>  
> 
> 
> 
> -----
> Cheers,
> Sean
> --
> View this message in context: 
> http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908531.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> 
>

Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”

Reply via email to