Re: UTF-8 character encoding

2018-06-27 Thread Lee
On 6/26/18, Michael Enright  wrote:
> On Mon, Jun 25, 2018 at 11:33 AM, Lee  wrote:
>> I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
>> 0xff is part of the utf-8 encoding.
>
> I don't see how you arrived at this.

I screwed up trying to do hex in my head.  For whatever reason I
didn't want to write 0 - 127

> An initial byte of 0xFF is not
> the initial byte of any valid UTF-8 byte sequence. And it doesn't
> conform with the statement you have later:

right, I screwed up :)

> The standards such as IETF RFC-3629 are easy enough to read, so I
> recommend using them and citing them to others instead of trying to
> summarize.

Thanks for the RFC reference - I hadn't come across that one yet.

Lee

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: UTF-8 character encoding

2018-06-27 Thread Lee
On 6/26/18, Thomas Wolff  wrote:

> This encoding scheme is wrong; where did you get it from? Maybe it's the
> obsolete UTF-8...

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

I thought I saw something about utf-8 being able to handle a 31 bit
value..  is that also obsolete/wrong?

how about this for the current encoding scheme:
http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf

Table 3-6.  UTF-8 Bit Distribution
BitsScalar Value   First Byte  Second Byte  Third Byte
 Fourth Byte
  7    0xxx0xxx
 11   0yyy yyxx110y10xx
 16    yyxx111010yy 10xx
 21   000u  yyxx   0uuu10uu 10yy10xx

Lee

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: UTF-8 character encoding

2018-06-26 Thread Michael Enright
On Mon, Jun 25, 2018 at 11:33 AM, Lee  wrote:
> I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
> 0xff is part of the utf-8 encoding.

I don't see how you arrived at this. An initial byte of 0xFF is not
the initial byte of any valid UTF-8 byte sequence. And it doesn't
conform with the statement you have later:

>  An easy way to remember this transformation format is to note that the
>  number of high-order 1's in the first byte is the same as the number of
>  subsequent bytes in the multibyte character:

This is true, but there is also a zero bit that ends the
high-order-1's bit string, which means that 0xFF is not a valid lead
byte. 0x7F is the highest byte value that you can have as a
single-byte UTF8 string.

Perhaps your statement about 0-0xFF was meant to be read differently.

Thomas Wolff's note seems to be objecting to the inclusion of
characters above U+10 which isn't legal UTF-8, but was in the
original proposal. Otherwise your table rows 1-4 is correct.

The standards such as IETF RFC-3629 are easy enough to read, so I
recommend using them and citing them to others instead of trying to
summarize.

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: UTF-8 character encoding

2018-06-26 Thread Thomas Wolff

Am 25.06.2018 um 20:33 schrieb Lee:

On 6/24/18, L A Walsh  wrote:

Lee wrote:

So... keep it simple, set
   LANG=en_US.UTF-8
and use vi or something else that comes with cygwin to create the file
and I'll have a file with UTF-8 character encoding - correct?

---
The first 127 characters of UTF-8 are identical to the
first 127 characters of ASCII, and latin1 and iso-8859-1.

If you don't use any characters that need accents or special symbols,
then nothing will be encoded in UTF-8, because its only
the characters OVER the first 127
(see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html).

I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
0xff is part of the utf-8 encoding.  This chart makes things clearer
... at least for me :)
 http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
  The proposed UCS transformation format encodes UCS values in the range
  [0,0x7fff] using multibyte characters of lengths 1, 2, 3, 4, and 5
  bytes.  For all encodings of more than one byte, the initial byte
  determines the number of bytes used and the high-order bit in each byte
  is set.

  An easy way to remember this transformation format is to note that the
  number of high-order 1's in the first byte is the same as the number of
  subsequent bytes in the multibyte character:

 Bits  Hex Min  Hex Max Byte Sequence in Binary
  17   007f 0zzz
  2   13  0080 207f 10zz 1yyy
  3   19  2080 0008207f 110z 1yyy 1xxx
  4   25  00082080 0208207f 1110 1yyy 1xxx 1www
  5   31  02082080 7fff 0zzz 1yyy 1xxx 1www 1vvv
This encoding scheme is wrong; where did you get it from? Maybe it's the 
obsolete UTF-8...


--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: UTF-8 character encoding

2018-06-25 Thread Lee
On 6/24/18, L A Walsh  wrote:
> Lee wrote:
>> So... keep it simple, set
>>   LANG=en_US.UTF-8
>> and use vi or something else that comes with cygwin to create the file
>> and I'll have a file with UTF-8 character encoding - correct?
> ---
>   The first 127 characters of UTF-8 are identical to the
> first 127 characters of ASCII, and latin1 and iso-8859-1.
>
> If you don't use any characters that need accents or special symbols,
> then nothing will be encoded in UTF-8, because its only
> the characters OVER the first 127
> (see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html).

I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
0xff is part of the utf-8 encoding.  This chart makes things clearer
... at least for me :)
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
 The proposed UCS transformation format encodes UCS values in the range
 [0,0x7fff] using multibyte characters of lengths 1, 2, 3, 4, and 5
 bytes.  For all encodings of more than one byte, the initial byte
 determines the number of bytes used and the high-order bit in each byte
 is set.

 An easy way to remember this transformation format is to note that the
 number of high-order 1's in the first byte is the same as the number of
 subsequent bytes in the multibyte character:

Bits  Hex Min  Hex Max Byte Sequence in Binary
 17   007f 0zzz
 2   13  0080 207f 10zz 1yyy
 3   19  2080 0008207f 110z 1yyy 1xxx
 4   25  00082080 0208207f 1110 1yyy 1xxx 1www
 5   31  02082080 7fff 0zzz 1yyy 1xxx 1www 1vvv

Thanks
Lee

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: UTF-8 character encoding

2018-06-24 Thread L A Walsh

Lee wrote:

So... keep it simple, set
  LANG=en_US.UTF-8
and use vi or something else that comes with cygwin to create the file
and I'll have a file with UTF-8 character encoding - correct?

---
The first 127 characters of UTF-8 are identical to the
first 127 characters of ASCII, and latin1 and iso-8859-1.

If you don't use any characters that need accents or special symbols,
then nothing will be encoded in UTF-8, because its only 
the characters OVER the first 127

(see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html).

The site also has a sw util (http://www.babelstone.co.uk/Software/BabelMap.html), 
that displays and helps config fonts
to display all the characters in unicode, though it hasn't 
been updated to the changes that came out last month or so

(Unicode 11).

It's a cool little, *free*, utility...though if you find it useful
you can always send in your registration.


--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: UTF-8 character encoding

2018-06-22 Thread Andrey Repin
Greetings, Lee!

> On 6/20/18, Andrey Repin wrote:
>> Greetings, Lee!
>>
>>> I'm looking at
>>>   https://cygwin.com/packaging-hint-files.html#pvr.hint
>>> and it starts off with
>>>   Use UTF-8 character encoding.
>>
>>> How do I do that and how do I check that I actually did use UTF-8
>>> character encoding _without_ using file?
>>
>> https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

> I think I don't know enough to ask the right question.  A quick search
> yesterday on byte order markers turned up
>  
> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx
> with this bit
>   Note   Microsoft uses UTF-16, little endian byte order.

Yes, default multibyte Windows encoding is UTF-16LE.
But in general, this is application specific.

> So... keep it simple, set
>   LANG=en_US.UTF-8
> and use vi or something else that comes with cygwin to create the file
> and I'll have a file with UTF-8 character encoding - correct?

I'm not familiar with vi, but this is true for other *NIX editors I know, they
use current locale settings by default, unless something else is specified in
their configuration or prompted by other cases (like byte order mark).

IMO, best chance is to use an editor that explicitly supports saving texts in
the desired encoding.
And please no BOM for UTF-8 files.


-- 
With best regards,
Andrey Repin
Friday, June 22, 2018 14:13:14

Sorry for my terrible english...


--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: UTF-8 character encoding

2018-06-21 Thread Lee
On 6/20/18, Andrey Repin wrote:
> Greetings, Lee!
>
>> I'm looking at
>>   https://cygwin.com/packaging-hint-files.html#pvr.hint
>> and it starts off with
>>   Use UTF-8 character encoding.
>
>> How do I do that and how do I check that I actually did use UTF-8
>> character encoding _without_ using file?
>
> https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

I think I don't know enough to ask the right question.  A quick search
yesterday on byte order markers turned up
  
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx
with this bit
  Note   Microsoft uses UTF-16, little endian byte order.

So... keep it simple, set
  LANG=en_US.UTF-8
and use vi or something else that comes with cygwin to create the file
and I'll have a file with UTF-8 character encoding - correct?

Thanks,
Lee

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: UTF-8 character encoding

2018-06-21 Thread Houder
On Thu, 21 Jun 2018 12:12:39, Houder wrote:
> On Wed, 20 Jun 2018 14:09:59, Lee wrote:
> > I'm looking at
> >   https://cygwin.com/packaging-hint-files.html#pvr.hint
> > and it starts off with
> >   Use UTF-8 character encoding.
> > 
> > How do I do that and how do I check that I actually did use UTF-8
> > character encoding _without_ using file?
> [snip]
> 
> > I used vi to create both files & I'd like to understand why file says
> > one is ascii & the other is utf-8
> 
> vim can tell you that in the statusline ...
> 
> :help statusline
> :help encoding
> 
> Ask Google to help you with the details: GS: "vim show encoding in status".
> 
> E.g.
> 
>  - http://vim.wikia.com/wiki/Show_fileencoding_and_bomb_in_the_status_line
>(Show fileencoding and bomb in the status line)
> 
> As an example:
> 
> set laststatus=2
> "set statusline=...
> set statusline+=\ en:\ %{strlen()\ ?\ \ :\ 'x'}
> "set statusline+...

Also read:

 - 
https://unix.stackexchange.com/questions/23389/how-can-i-set-vims-default-encoding-to-utf-8
   (How can I set VIM's default encoding to UTF-8?)

for a "quickstart" on the subject of character encoding/vim.

Henri


--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: UTF-8 character encoding

2018-06-21 Thread Houder
On Wed, 20 Jun 2018 14:09:59, Lee wrote:
> I'm looking at
>   https://cygwin.com/packaging-hint-files.html#pvr.hint
> and it starts off with
>   Use UTF-8 character encoding.
> 
> How do I do that and how do I check that I actually did use UTF-8
> character encoding _without_ using file?
[snip]

> I used vi to create both files & I'd like to understand why file says
> one is ascii & the other is utf-8

vim can tell you that in the statusline ...

:help statusline
:help encoding

Ask Google to help you with the details: GS: "vim show encoding in status".

E.g.

 - http://vim.wikia.com/wiki/Show_fileencoding_and_bomb_in_the_status_line
   (Show fileencoding and bomb in the status line)

As an example:

set laststatus=2
"set statusline=...
set statusline+=\ en:\ %{strlen()\ ?\ \ :\ 'x'}
"set statusline+...

Henri


--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: UTF-8 character encoding

2018-06-20 Thread Andrey Repin
Greetings, Lee!

> I'm looking at
>   https://cygwin.com/packaging-hint-files.html#pvr.hint
> and it starts off with
>   Use UTF-8 character encoding.

> How do I do that and how do I check that I actually did use UTF-8
> character encoding _without_ using file?

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

> for whatever it's worth:
> $ file unicode.html
> unicode.html: HTML document, UTF-8 Unicode text

> $ file test.c
> test.c: C source, ASCII text

> I used vi to create both files & I'd like to understand why file says
> one is ascii & the other is utf-8


-- 
With best regards,
Andrey Repin
Thursday, June 21, 2018 4:25:27

Sorry for my terrible english...


--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: UTF-8 character encoding

2018-06-20 Thread Stefan Weil
Am 20.06.2018 um 20:09 schrieb Lee:
> I'm looking at
>   https://cygwin.com/packaging-hint-files.html#pvr.hint
> and it starts off with
>   Use UTF-8 character encoding.
> 
> How do I do that and how do I check that I actually did use UTF-8
> character encoding _without_ using file?
> 
> for whatever it's worth:
> $ file unicode.html
> unicode.html: HTML document, UTF-8 Unicode text
> 
> $ file test.c
> test.c: C source, ASCII text
> 
> I used vi to create both files & I'd like to understand why file says
> one is ascii & the other is utf-8
> 
> Thanks,
> Lee

ASCII is a subset of UTF-8, so that's fine.

The file command will report ASCII as long as your text does not contain
any non-ASCII characters. If you add some (for example ÄÖÜ), it should
report UTF-8.

Regards,
Stefan


--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple