Re: DMD: invalid UTF character `\U0000d800`

2020-11-12 Thread Per Nordlöw via Digitalmars-d-learn

On Monday, 9 November 2020 at 16:39:49 UTC, Boris Carvajal wrote:

There's also:

dchar(0xd8000)


Thanks


Re: DMD: invalid UTF character `\U0000d800`

2020-11-09 Thread Boris Carvajal via Digitalmars-d-learn

On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:

Can I just do, for instance,

cast(dchar)0xd8000

for

`\Ud800`

to accomplish this?


There's also:

dchar(0xd8000)


Re: DMD: invalid UTF character `\U0000d800`

2020-11-08 Thread Jacob Carlborg via Digitalmars-d-learn

On 2020-11-08 13:39, Kagamin wrote:

Surrogate pairs are used in rules because java strings are utf-16 
encoded, it doesn't make much sense for other encodings.


D supports the UTF-16 encoding as well. The compiler doesn't accept the 
surrogate pairs even for UTF-16 strings.


--
/Jacob Carlborg


Re: DMD: invalid UTF character `\U0000d800`

2020-11-08 Thread Steven Schveighoffer via Digitalmars-d-learn

On 11/8/20 5:47 AM, Per Nordlöw wrote:

On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg wrote:

[1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF


Thanks!

I'm only using these UTF characters to create ranges that source code 
characters as checked against during parsing. Therefore I would like to 
just convert these to a `dchar` for now using a `cast`. Can I just do, 
for instance,


     cast(dchar)0xd8000

for

     `\Ud800`

to accomplish this?


Yes, use the cast. It should work.

It's just the D grammar that is stopping you, a dchar is just an integer 
under the hood, so the cast should be fine.


-Steve


Re: DMD: invalid UTF character `\U0000d800`

2020-11-08 Thread Kagamin via Digitalmars-d-learn

On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:

dchar


Surrogate pairs are used in rules because java strings are utf-16 
encoded, it doesn't make much sense for other encodings.


Re: DMD: invalid UTF character `\U0000d800`

2020-11-08 Thread Per Nordlöw via Digitalmars-d-learn

On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:

cast(dchar)0xd8000


To clarify,

enum dch1 = cast(dchar)0xa0a0;
enum dch2 = '\ua0a0';
assert(dch1 == dch2);

works. Can I use the first-variant if I want to postpone these 
encoding questions for now?


Re: DMD: invalid UTF character `\U0000d800`

2020-11-08 Thread Per Nordlöw via Digitalmars-d-learn
On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg 
wrote:

[1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF


Thanks!

I'm only using these UTF characters to create ranges that source 
code characters as checked against during parsing. Therefore I 
would like to just convert these to a `dchar` for now using a 
`cast`. Can I just do, for instance,


cast(dchar)0xd8000

for

`\Ud800`

to accomplish this?


Re: DMD: invalid UTF character `\U0000d800`

2020-11-07 Thread Jacob Carlborg via Digitalmars-d-learn

On Saturday, 7 November 2020 at 16:12:06 UTC, Per Nordlöw wrote:

 CtoLexer_parser.d   665  57 error   invalid UTF 
character \Ud800
 CtoLexer_parser.d   665  67 error   invalid UTF 
character \Udbff
 CtoLexer_parser.d   666  28 error   invalid UTF 
character \Ud800
 CtoLexer_parser.d   666  38 error   invalid UTF 
character \Udbff
 CtoLexer_parser.d   666  53 error   invalid UTF 
character \Udc00
 CtoLexer_parser.d   666  63 error   invalid UTF 
character \Udfff


Doesn't DMD support these Unicodes yet?


They're not valid:

"The Unicode standard permanently reserves these code point 
values for UTF-16 encoding of the high and low surrogates, and 
they will never be assigned a character, so there should be no 
reason to encode them. The official Unicode standard says that no 
UTF forms, including UTF-16, can encode these code points" [1].


"... the standard states that such arrangements should be treated 
as encoding errors" [1].


Perhaps they need to be combined with other code points to form a 
valid character.


[1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF

--
/Jacob Carlborg