On Mon, Jan 05, 2015 at 09:52:12PM +0000, Steve Simon wrote: > I am trying to parse a stream from a tcp connection. > > I think the data is utf8, here is a sample > > 20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73 > > which when I print it I get: > > - e s k r o z h l a s > ^ ^ > missing missing > > there are two missing characters. Ok, bad UTF8 perhaps? > but when I try unicode(1) I see: > > unicode c8 fd > È > ý > > Is this 8 bit runes? (!) > Is there a name for such a thing? > Is this common? > Is it just MS code pages but the >0x7f values happen (designed to) to map > onto the same letters as utf8? > > thanks in advance of useful suggestions ☺ > > -Steve > >
Those might be ISO-8859-1 octets. The first 256 codepoints of Unicode are those of ISO-8859-1. % unicode -t 20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73 | xd - right away produces 18 octets representing 16 Unicode codepoints (listed as args). % ascii -t 20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73 | tcs -f 8859-1 | xd - yields 16 octets which are then treated as ones encoding 16 codepoints in ISO-8859-1 and transformed to 18 octets in UTF-8 (representing those same 16 codepoints). Hope this helps, -- Antons