Re: UTF-8 character encoding
On 6/26/18, Michael Enright wrote: > On Mon, Jun 25, 2018 at 11:33 AM, Lee wrote: >> I'm still trying to figure utf-8 out, but it seems to me that 0x0 - >> 0xff is part of the utf-8 encoding. > > I don't see how you arrived at this. I screwed up trying to do hex in my head. For whatever reason I didn't want to write 0 - 127 > An initial byte of 0xFF is not > the initial byte of any valid UTF-8 byte sequence. And it doesn't > conform with the statement you have later: right, I screwed up :) > The standards such as IETF RFC-3629 are easy enough to read, so I > recommend using them and citing them to others instead of trying to > summarize. Thanks for the RFC reference - I hadn't come across that one yet. Lee -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: UTF-8 character encoding
On 6/26/18, Thomas Wolff wrote: > This encoding scheme is wrong; where did you get it from? Maybe it's the > obsolete UTF-8... http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt I thought I saw something about utf-8 being able to handle a 31 bit value.. is that also obsolete/wrong? how about this for the current encoding scheme: http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf Table 3-6. UTF-8 Bit Distribution BitsScalar Value First Byte Second Byte Third Byte Fourth Byte 7 0xxx0xxx 11 0yyy yyxx110y10xx 16 yyxx111010yy 10xx 21 000u yyxx 0uuu10uu 10yy10xx Lee -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: UTF-8 character encoding
On Mon, Jun 25, 2018 at 11:33 AM, Lee wrote: > I'm still trying to figure utf-8 out, but it seems to me that 0x0 - > 0xff is part of the utf-8 encoding. I don't see how you arrived at this. An initial byte of 0xFF is not the initial byte of any valid UTF-8 byte sequence. And it doesn't conform with the statement you have later: > An easy way to remember this transformation format is to note that the > number of high-order 1's in the first byte is the same as the number of > subsequent bytes in the multibyte character: This is true, but there is also a zero bit that ends the high-order-1's bit string, which means that 0xFF is not a valid lead byte. 0x7F is the highest byte value that you can have as a single-byte UTF8 string. Perhaps your statement about 0-0xFF was meant to be read differently. Thomas Wolff's note seems to be objecting to the inclusion of characters above U+10 which isn't legal UTF-8, but was in the original proposal. Otherwise your table rows 1-4 is correct. The standards such as IETF RFC-3629 are easy enough to read, so I recommend using them and citing them to others instead of trying to summarize. -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: UTF-8 character encoding
Am 25.06.2018 um 20:33 schrieb Lee: On 6/24/18, L A Walsh wrote: Lee wrote: So... keep it simple, set LANG=en_US.UTF-8 and use vi or something else that comes with cygwin to create the file and I'll have a file with UTF-8 character encoding - correct? --- The first 127 characters of UTF-8 are identical to the first 127 characters of ASCII, and latin1 and iso-8859-1. If you don't use any characters that need accents or special symbols, then nothing will be encoded in UTF-8, because its only the characters OVER the first 127 (see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html). I'm still trying to figure utf-8 out, but it seems to me that 0x0 - 0xff is part of the utf-8 encoding. This chart makes things clearer ... at least for me :) http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt The proposed UCS transformation format encodes UCS values in the range [0,0x7fff] using multibyte characters of lengths 1, 2, 3, 4, and 5 bytes. For all encodings of more than one byte, the initial byte determines the number of bytes used and the high-order bit in each byte is set. An easy way to remember this transformation format is to note that the number of high-order 1's in the first byte is the same as the number of subsequent bytes in the multibyte character: Bits Hex Min Hex Max Byte Sequence in Binary 17 007f 0zzz 2 13 0080 207f 10zz 1yyy 3 19 2080 0008207f 110z 1yyy 1xxx 4 25 00082080 0208207f 1110 1yyy 1xxx 1www 5 31 02082080 7fff 0zzz 1yyy 1xxx 1www 1vvv This encoding scheme is wrong; where did you get it from? Maybe it's the obsolete UTF-8... -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: UTF-8 character encoding
On 6/24/18, L A Walsh wrote: > Lee wrote: >> So... keep it simple, set >> LANG=en_US.UTF-8 >> and use vi or something else that comes with cygwin to create the file >> and I'll have a file with UTF-8 character encoding - correct? > --- > The first 127 characters of UTF-8 are identical to the > first 127 characters of ASCII, and latin1 and iso-8859-1. > > If you don't use any characters that need accents or special symbols, > then nothing will be encoded in UTF-8, because its only > the characters OVER the first 127 > (see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html). I'm still trying to figure utf-8 out, but it seems to me that 0x0 - 0xff is part of the utf-8 encoding. This chart makes things clearer ... at least for me :) http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt The proposed UCS transformation format encodes UCS values in the range [0,0x7fff] using multibyte characters of lengths 1, 2, 3, 4, and 5 bytes. For all encodings of more than one byte, the initial byte determines the number of bytes used and the high-order bit in each byte is set. An easy way to remember this transformation format is to note that the number of high-order 1's in the first byte is the same as the number of subsequent bytes in the multibyte character: Bits Hex Min Hex Max Byte Sequence in Binary 17 007f 0zzz 2 13 0080 207f 10zz 1yyy 3 19 2080 0008207f 110z 1yyy 1xxx 4 25 00082080 0208207f 1110 1yyy 1xxx 1www 5 31 02082080 7fff 0zzz 1yyy 1xxx 1www 1vvv Thanks Lee -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: UTF-8 character encoding
Lee wrote: So... keep it simple, set LANG=en_US.UTF-8 and use vi or something else that comes with cygwin to create the file and I'll have a file with UTF-8 character encoding - correct? --- The first 127 characters of UTF-8 are identical to the first 127 characters of ASCII, and latin1 and iso-8859-1. If you don't use any characters that need accents or special symbols, then nothing will be encoded in UTF-8, because its only the characters OVER the first 127 (see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html). The site also has a sw util (http://www.babelstone.co.uk/Software/BabelMap.html), that displays and helps config fonts to display all the characters in unicode, though it hasn't been updated to the changes that came out last month or so (Unicode 11). It's a cool little, *free*, utility...though if you find it useful you can always send in your registration. -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: UTF-8 character encoding
Greetings, Lee! > On 6/20/18, Andrey Repin wrote: >> Greetings, Lee! >> >>> I'm looking at >>> https://cygwin.com/packaging-hint-files.html#pvr.hint >>> and it starts off with >>> Use UTF-8 character encoding. >> >>> How do I do that and how do I check that I actually did use UTF-8 >>> character encoding _without_ using file? >> >> https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ > I think I don't know enough to ask the right question. A quick search > yesterday on byte order markers turned up > > https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx > with this bit > Note Microsoft uses UTF-16, little endian byte order. Yes, default multibyte Windows encoding is UTF-16LE. But in general, this is application specific. > So... keep it simple, set > LANG=en_US.UTF-8 > and use vi or something else that comes with cygwin to create the file > and I'll have a file with UTF-8 character encoding - correct? I'm not familiar with vi, but this is true for other *NIX editors I know, they use current locale settings by default, unless something else is specified in their configuration or prompted by other cases (like byte order mark). IMO, best chance is to use an editor that explicitly supports saving texts in the desired encoding. And please no BOM for UTF-8 files. -- With best regards, Andrey Repin Friday, June 22, 2018 14:13:14 Sorry for my terrible english... -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: UTF-8 character encoding
On 6/20/18, Andrey Repin wrote: > Greetings, Lee! > >> I'm looking at >> https://cygwin.com/packaging-hint-files.html#pvr.hint >> and it starts off with >> Use UTF-8 character encoding. > >> How do I do that and how do I check that I actually did use UTF-8 >> character encoding _without_ using file? > > https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ I think I don't know enough to ask the right question. A quick search yesterday on byte order markers turned up https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx with this bit Note Microsoft uses UTF-16, little endian byte order. So... keep it simple, set LANG=en_US.UTF-8 and use vi or something else that comes with cygwin to create the file and I'll have a file with UTF-8 character encoding - correct? Thanks, Lee -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: UTF-8 character encoding
On Thu, 21 Jun 2018 12:12:39, Houder wrote: > On Wed, 20 Jun 2018 14:09:59, Lee wrote: > > I'm looking at > > https://cygwin.com/packaging-hint-files.html#pvr.hint > > and it starts off with > > Use UTF-8 character encoding. > > > > How do I do that and how do I check that I actually did use UTF-8 > > character encoding _without_ using file? > [snip] > > > I used vi to create both files & I'd like to understand why file says > > one is ascii & the other is utf-8 > > vim can tell you that in the statusline ... > > :help statusline > :help encoding > > Ask Google to help you with the details: GS: "vim show encoding in status". > > E.g. > > - http://vim.wikia.com/wiki/Show_fileencoding_and_bomb_in_the_status_line >(Show fileencoding and bomb in the status line) > > As an example: > > set laststatus=2 > "set statusline=... > set statusline+=\ en:\ %{strlen()\ ?\ \ :\ 'x'} > "set statusline+... Also read: - https://unix.stackexchange.com/questions/23389/how-can-i-set-vims-default-encoding-to-utf-8 (How can I set VIM's default encoding to UTF-8?) for a "quickstart" on the subject of character encoding/vim. Henri -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: UTF-8 character encoding
On Wed, 20 Jun 2018 14:09:59, Lee wrote: > I'm looking at > https://cygwin.com/packaging-hint-files.html#pvr.hint > and it starts off with > Use UTF-8 character encoding. > > How do I do that and how do I check that I actually did use UTF-8 > character encoding _without_ using file? [snip] > I used vi to create both files & I'd like to understand why file says > one is ascii & the other is utf-8 vim can tell you that in the statusline ... :help statusline :help encoding Ask Google to help you with the details: GS: "vim show encoding in status". E.g. - http://vim.wikia.com/wiki/Show_fileencoding_and_bomb_in_the_status_line (Show fileencoding and bomb in the status line) As an example: set laststatus=2 "set statusline=... set statusline+=\ en:\ %{strlen()\ ?\ \ :\ 'x'} "set statusline+... Henri -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: UTF-8 character encoding
Greetings, Lee! > I'm looking at > https://cygwin.com/packaging-hint-files.html#pvr.hint > and it starts off with > Use UTF-8 character encoding. > How do I do that and how do I check that I actually did use UTF-8 > character encoding _without_ using file? https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ > for whatever it's worth: > $ file unicode.html > unicode.html: HTML document, UTF-8 Unicode text > $ file test.c > test.c: C source, ASCII text > I used vi to create both files & I'd like to understand why file says > one is ascii & the other is utf-8 -- With best regards, Andrey Repin Thursday, June 21, 2018 4:25:27 Sorry for my terrible english... -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: UTF-8 character encoding
Am 20.06.2018 um 20:09 schrieb Lee: > I'm looking at > https://cygwin.com/packaging-hint-files.html#pvr.hint > and it starts off with > Use UTF-8 character encoding. > > How do I do that and how do I check that I actually did use UTF-8 > character encoding _without_ using file? > > for whatever it's worth: > $ file unicode.html > unicode.html: HTML document, UTF-8 Unicode text > > $ file test.c > test.c: C source, ASCII text > > I used vi to create both files & I'd like to understand why file says > one is ascii & the other is utf-8 > > Thanks, > Lee ASCII is a subset of UTF-8, so that's fine. The file command will report ASCII as long as your text does not contain any non-ASCII characters. If you add some (for example ÄÖÜ), it should report UTF-8. Regards, Stefan -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple