MaGGE wrote:
Hello again Ken,

Sorry to lag behind on the replies - work is busy these days... :)

Seems you're right. I've made a custom ResponseHandler class to be able to
dump the raw output from HttpClient. However, I'd used
FileWriter/BufferedWriter to dump to my file. This must've tried to
interpret charset also, causing the bothersome 0x3F's mentioned before.
Your tip about another HttpClient app returning the content successfully
caused me to look at my method again - and I made the output via
FileOutputStream,write(byte[]) instead. Using hexdump as before I can now
confirm that there's no longer a 0x3F but 0xC3 0xA5 as it should be.

(...from wget)
# hexdump -s 0x1845 -C index.html | head -n 2
00001845  70 c3 a5 20 76 65 67 67  65 6e 20 28 62 6c 6f 67  |p.. veggen
(blog|
00001855  67 29 3c 2f 61 3e 3c 2f  6c 69 3e 0a 09 09 3c 6c  |g) </li>...<l|

(...from my dump)
# hexdump -s 0x1845 -C raw.txt | head -n 2
00001845  70 c3 a5 20 76 65 67 67  65 6e 20 28 62 6c 6f 67  |p.. veggen
(blog|
00001855  67 29 3c 2f 61 3e 3c 2f  6c 69 3e 0a 09 09 3c 6c  |g) </li>...<l|

What remains a mystery to me is, however, why the string returned from
HttpClient.execute() and thus EntityUtils.toString(Entity) does not
represent the letters correctly. I also tested this with the
BasicResponseHandler to be sure it was nothing I'd done.

Atleast now I can use my custom ResponseHandler and figure out how to treat
the intact byte-array correctly. So thanks a lot! :)



If you only listened and produced a wire / context log, when I asked you, all this could have been found out much earlier, and I most likely would also have been able to tell why EntityUtils#toString failed to detect the charset.

Oleg




Ken Krugler wrote:
Hi Magnus,

I used curl to grab the file, and the bytes at 0x1845...0x1847 are 0xC3 0xA5, which is valid UTF-8 for the u00E5 code point (latin small letter a with ring above).

I also used Bixo (http://bixo.101tec.com) to crawl the same page, and wound up with the same raw data. Bixo uses HttpClient 4.0, so it's a good test.

Given what you've tried (in your initial email), I've only got one weak guess - that your tools are showing you stuff that isn't actually there.




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to