Re: [Firebird-devel] String escapes for codepoints

Kjell Rilbe Wed, 25 Sep 2019 22:51:44 -0700

Den 2019-09-25 kl. 18:05, skrev Mark Rotteveel:
> On 25-9-2019 10:11, Kjell Rilbe wrote:
>> The suggested <binary string literal> could be used to write characters
>> using the literal's encoding directly. E.g. För UTF-8 literal, the
>> character Ö could be written as '\xC3\xB6', and in an WIN1252 literal it
>> could be written as '\xD6'.
>>
>> Since these kinds of escapes would be a breaking change to how string
>> literals are parsed, a solution would have to be found to determine if a
>> specific string literal is to be parsed with these kinds of escapes or
>> not. A prefix than could be combined with any character set prefix?
>
> The SQL standard option would be for Firebird to introduce support for 
> the Unicode character string literal U&'...' UESCAPE '<char>' (where 
> the default unicode escape is \ if the UESCAPE clause is absent).
>
> With this syntax, your example would be U&'\C3B6'.
>
> From 5.3 <literal>:
>
> <Unicode character string literal> ::=
> [ <introducer> <character set specification> ]
> U <ampersand> <quote> [ <Unicode representation>... ] <quote>
> [ { <separator> <quote> [ <Unicode representation>... ] <quote> }... ]
> <Unicode escape specifier>
>
> <Unicode representation> ::=
> <character representation>
> | <Unicode escape value>
>
> Relevant syntax rules:
>
> 2) In a <Unicode character string literal>, the sequence:
> <quote> <Unicode representation>... <quote> <separator> <quote> 
> <Unicode representation>... <quote>
> is equivalent to the sequence:
> <quote> <Unicode representation>... <Unicode representation>... <quote>
> 3) In a <Unicode character string literal>, the introductory 'U' may 
> be represented either in upper case (as 'U') or in lower case (as 'u').
> [..]
> 7) In a <character string literal>, <national character string 
> literal>, <Unicode character string literal>, or <binary string 
> literal>, a <separator> shall contain a <newline>.
> [..]
> 10) In a <Unicode character string literal> that specifies 
> “<introducer><character set specification>”, there shall be no 
> <separator> between the <introducer> and the <character set 
> specification>.
> 11) In a <Unicode character string literal>, there shall be no 
> <separator> between the “U” and the <ampersand> nor between the 
> <ampersand> and the <quote>.
> 12) The character set of a <Unicode character string literal> that 
> specifies “<introducer><character set specification>” is the character 
> set specified by the <character set specification>. The character set 
> of a <Unicode character string literal> that does not specify 
> “<introducer><character set specification>” is the character set of 
> the SQL-client module that contains the <Unicode character string 
> literal>.
> 13) A <Unicode character string literal> is equivalent to a <character 
> string literal> in which every <Unicode escape value> has been 
> replaced with the equivalent Unicode character. The set of characters 
> contained
> in the <Unicode character string literal> shall be wholly contained in 
> the character set of the <Unicode character string literal>.
> NOTE 139 — The requirement for “wholly contained” applies after the 
> replacement of <Unicode escape value>s with equivalent Unicode 
> characters.
> (rules 14-19 are also relevant, but apply to all string literals, left 
> out for brevity)
>
> From 5.2 <token> and <separator>:
>
> <Unicode escape specifier> ::=
> [ UESCAPE <quote> <Unicode escape character> <quote> ]
>
> <Unicode escape value> ::=
> <Unicode 4 digit escape value>
> | <Unicode 6 digit escape value>
> | <Unicode character escape value>
>
> <Unicode 4 digit escape value> ::=
> <Unicode escape character> <hexit> <hexit> <hexit> <hexit>
>
> <Unicode 6 digit escape value> ::=
> <Unicode escape character> <plus sign>
> <hexit> <hexit> <hexit> <hexit> <hexit> <hexit>
>
> <Unicode character escape value> ::=
> <Unicode escape character> <Unicode escape character>
> <Unicode escape character> ::=
>
> Relevant syntax rules:
>
> 16) <Unicode escape character> shall be a single character from the 
> source language character set other than a <hexit>, <plus sign>, 
> <quote>, <double quote>, or <white space>.
> 17) If the source language character set contains <reverse solidus>, 
> then let DEC be <reverse solidus>; otherwise, let DEC be an 
> implementation-defined character from the source language character 
> set that is not a <hexit>, <plus sign>, <quote>, <double quote>, or 
> <white space>.
> 18) If a <Unicode escape specifier> does not contain <Unicode escape 
> character>, then “UESCAPE <quote>DEC<quote>” is implicit.
> 19) In a <Unicode escape value> there shall be no <separator> between 
> the <Unicode escape character> and the first <hexit>, nor between any 
> of the <hexit>s.
> [..]
> 21) <Unicode 4 digit escape value> '<Unicode escape character>xyzw' is 
> equivalent to the character at the Unicode code point specified by 
> U+xyzw.
> 22) <Unicode 6 digit escape value> '<Unicode escape character>+xyzwrs' 
> is equivalent to the character at the Unicode code point specified by 
> U+xyzwrs.
> NOTE 132 — The 6-hexit notation is derived by taking the UCS-4 
> notation defined in [ISO10646] and removing the leading two hexits, 
> whose values are always 0 (zero).
> 23) <Unicode character escape value> is equivalent to a single 
> instance of <Unicode escape character>.


Nice! Thanks! This has got my vote.

Regards,
Kjell

<<attachment: kjell_rilbe.vcf>>

Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Re: [Firebird-devel] String escapes for codepoints

Reply via email to