On 26 May 2009, at 07:40, Sven Hartrumpf wrote: > Mon, 25 May 2009 17:18:19 +0100, richard wrote: >>>>> <http://dbpedia.org/resource/%C4%8C%C3%A1raj%C3%A1vri> ... >>>>> How can %C4%8C be decoded? Obviously it's not Unicode. >>>> That is URL encoding. >>> >>> I should have spent some more details here: If I url-decode the >>> above, >>> I don't know what the result should be. UTF-8? >> >> Yes. The byte sequence that you get after decoding the %-encoding is >> to be turned into a character sequence by using UTF-8. > >> echo resource/%C4%8C%C3%A1raj%C3%A1vri | urldecode > resource/Äárajávri >> echo resource/Äárajávri | unihist > Invalid UTF-8 code encountered at line 0, character 9, byte 9. > The sequence is not a valid UTF-8 character because > the first byte, value 0xC4, bit pattern 11000100, > requires 1 continuation bytes, but of the immediately > following bytes, byte 1, value 0xC3, bit pattern > 11000100 is not a valid continuation byte, since > its high bits are not 10.
Use proper tools. The continuation byte after 0xC4 is 0x8C, not 0xC3. This is plainly obvious from looking at the original %-encoded string. 0xC4 0x8C in binary is 11000100 10001100, the payload bits are ***00100 **001100 (see [1] for handy table), which in hex is 0x10C, which according to [2] is LATIN CAPITAL LETTER C WITH CARON: "Č". The entire string is "Čárajávri", which I figured out simply by copy- pasting the original URI into my browser's URL bar and hitting ENTER. In general, don't pass unicode characters through the shell. This will just mess things up. Store them in a file, open it in your web browser, and try different options from the "View -> Character Encoding" menu to understand what's going on. Best, Richard [1] http://en.wikipedia.org/wiki/UTF-8#Description [2] http://www.unicode.org/charts/PDF/U0100.pdf > >> > > ------------------------------------------------------------------------------ > Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT > is a gathering of tech-side developers & brand creativity > professionals. Meet > the minds behind Google Creative Lab, Visual Complexity, Processing, & > iPhoneDevCamp asthey present alongside digital heavyweights like > Barbarian > Group, R/GA, & Big Spaceship. http://www.creativitycat.com > _______________________________________________ > Dbpedia-discussion mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion ------------------------------------------------------------------------------ Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT is a gathering of tech-side developers & brand creativity professionals. Meet the minds behind Google Creative Lab, Visual Complexity, Processing, & iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian Group, R/GA, & Big Spaceship. http://www.creativitycat.com _______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
