Hello again Ken, Sorry to lag behind on the replies - work is busy these days... :)
Seems you're right. I've made a custom ResponseHandler class to be able to dump the raw output from HttpClient. However, I'd used FileWriter/BufferedWriter to dump to my file. This must've tried to interpret charset also, causing the bothersome 0x3F's mentioned before. Your tip about another HttpClient app returning the content successfully caused me to look at my method again - and I made the output via FileOutputStream,write(byte[]) instead. Using hexdump as before I can now confirm that there's no longer a 0x3F but 0xC3 0xA5 as it should be. (...from wget) # hexdump -s 0x1845 -C index.html | head -n 2 00001845 70 c3 a5 20 76 65 67 67 65 6e 20 28 62 6c 6f 67 |p.. veggen (blog| 00001855 67 29 3c 2f 61 3e 3c 2f 6c 69 3e 0a 09 09 3c 6c |g) </li>...<l| (...from my dump) # hexdump -s 0x1845 -C raw.txt | head -n 2 00001845 70 c3 a5 20 76 65 67 67 65 6e 20 28 62 6c 6f 67 |p.. veggen (blog| 00001855 67 29 3c 2f 61 3e 3c 2f 6c 69 3e 0a 09 09 3c 6c |g) </li>...<l| What remains a mystery to me is, however, why the string returned from HttpClient.execute() and thus EntityUtils.toString(Entity) does not represent the letters correctly. I also tested this with the BasicResponseHandler to be sure it was nothing I'd done. Atleast now I can use my custom ResponseHandler and figure out how to treat the intact byte-array correctly. So thanks a lot! :) Ken Krugler wrote: > > Hi Magnus, > > I used curl to grab the file, and the bytes at 0x1845...0x1847 are > 0xC3 0xA5, which is valid UTF-8 for the u00E5 code point (latin small > letter a with ring above). > > I also used Bixo (http://bixo.101tec.com) to crawl the same page, and > wound up with the same raw data. Bixo uses HttpClient 4.0, so it's a > good test. > > Given what you've tried (in your initial email), I've only got one > weak guess - that your tools are showing you stuff that isn't actually > there. > -- View this message in context: http://www.nabble.com/Charset-trouble%2C-questionmarks-tp25253439p25307019.html Sent from the HttpClient-User mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
