On 26 May 2009, at 07:40, Sven Hartrumpf wrote:

> Mon, 25 May 2009 17:18:19 +0100, richard wrote:
>>>>> <http://dbpedia.org/resource/%C4%8C%C3%A1raj%C3%A1vri> ...
>>>>> How can %C4%8C be decoded? Obviously it's not Unicode.
>>>> That is URL encoding.
>>>
>>> I should have spent some more details here: If I url-decode the  
>>> above,
>>> I don't know what the result should be. UTF-8?
>>
>> Yes. The byte sequence that you get after decoding the %-encoding is
>> to be turned into a character sequence by using UTF-8.
>
>> echo resource/%C4%8C%C3%A1raj%C3%A1vri | urldecode
> resource/Äárajávri
>> echo resource/Äárajávri | unihist
> Invalid UTF-8 code encountered at line 0, character 9, byte 9.
> The sequence is not a valid UTF-8 character because
> the first byte, value 0xC4, bit pattern 11000100,
> requires 1 continuation bytes, but of the immediately
> following bytes, byte 1, value 0xC3, bit pattern
> 11000100 is not a valid continuation byte, since
> its high bits are not 10.

Use proper tools. The continuation byte after 0xC4 is 0x8C, not 0xC3.  
This is plainly obvious from looking at the original %-encoded string.

0xC4 0x8C in binary is 11000100 10001100, the payload bits are  
***00100 **001100 (see [1] for handy table), which in hex is 0x10C,  
which according to [2] is LATIN CAPITAL LETTER C WITH CARON: "Č". The  
entire string is "Čárajávri", which I figured out simply by copy- 
pasting the original URI into my browser's URL bar and hitting ENTER.

In general, don't pass unicode characters through the shell. This will  
just mess things up. Store them in a file, open it in your web  
browser, and try different options from the "View -> Character  
Encoding" menu to understand what's going on.

Best,
Richard

[1] http://en.wikipedia.org/wiki/UTF-8#Description
[2] http://www.unicode.org/charts/PDF/U0100.pdf



>
>>
>
> ------------------------------------------------------------------------------
> Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
> is a gathering of tech-side developers & brand creativity  
> professionals. Meet
> the minds behind Google Creative Lab, Visual Complexity, Processing, &
> iPhoneDevCamp asthey present alongside digital heavyweights like  
> Barbarian
> Group, R/GA, & Big Spaceship. http://www.creativitycat.com
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion


------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
is a gathering of tech-side developers & brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing, & 
iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian
Group, R/GA, & Big Spaceship. http://www.creativitycat.com 
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to