Re: [go-nuts] understanding utf-8 for a newbie

Sam Whited Sat, 06 May 2017 16:54:12 -0700

On Fri, May 5, 2017 at 8:11 PM, rob solomon <drrob...@verizon.net> wrote:
> I decided to first change ", ' and emdash characters.  Using hexdump -C in
> Ubuntu, the runes in the file are:
>
> open quote = 0xE2809C
>
> close quote = 0xE2809D
>
> apostrophe = 0xE28099
>
> emdash = 0xE28094


The output of hexdump will be the actual bytes of the file; these are
the UTF-8 encoded values.

> However, when I write a simple program to display these runes from the file,
> using the routines in unicode/utf8, I get very different values.  I do not
> understand this.
>
> open quote = 0x201C
>
> close quote = 0x201D
>
> apostrophe = 0x2019
>
> emdash = 0x2014.

These are called Unicode codepoints. In Unicode lots of different
things like letters, numbers, emoji, etc. are assigned numbers  (Go's
type for storing codepoints is called "rune"). These numbers are then
encoded using an encoding such as UTF-8 to make the final output which
you saw when you used hexdump. The Unicode codepoint of an em dash is
always U+2014 (sometimes they're written this way, prefixed by `U+'),
but the encoding might be different depending on what system you're on
or what file format you're using.

Here is an example of encoding a rune with the value 0x2014 as UTF-8,
which gives the number you observed in your hexdump output:
https://play.golang.org/p/ddIfzobKD4

—Sam

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] understanding utf-8 for a newbie

Reply via email to