On Mon, Jan 05, 2015 at 09:52:12PM +0000, Steve Simon wrote:
> I am trying to parse a stream from a tcp connection.
> 
> I think the data is utf8, here is a sample
> 
>        20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73
> 
> which when I print it I get:
> 
>      -       e  s  k       r  o  z  h  l  a  s           
>            ^          ^
>         missing    missing
> 
> there are two missing characters. Ok, bad UTF8 perhaps?
> but when I try unicode(1) I see:
> 
>       unicode c8 fd
>       È
>       ý
> 
> Is this 8 bit runes? (!)
> Is there a name for such a thing?
> Is this common?
> Is it just MS code pages but the >0x7f values happen (designed to) to map 
> onto the same letters as utf8?
> 
> thanks in advance of useful suggestions ☺
> 
> -Steve
> 
> 

Those might be ISO-8859-1 octets.
The first 256 codepoints of Unicode are those of ISO-8859-1.

% unicode -t 20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73 | xd
- right away produces 18 octets representing 16 Unicode codepoints (listed as 
args).

% ascii -t 20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73 | tcs -f 8859-1 | xd
- yields 16 octets which are then treated as ones encoding 16 codepoints in 
ISO-8859-1
and transformed to 18 octets in UTF-8 (representing those same 16 codepoints).

Hope this helps,

-- 
Antons

Reply via email to