On 25-9-2019 10:11, Kjell Rilbe wrote:
The suggested <binary string literal> could be used to write characters
using the literal's encoding directly. E.g. För UTF-8 literal, the
character Ö could be written as '\xC3\xB6', and in an WIN1252 literal it
could be written as '\xD6'.

Since these kinds of escapes would be a breaking change to how string
literals are parsed, a solution would have to be found to determine if a
specific string literal is to be parsed with these kinds of escapes or
not. A prefix than could be combined with any character set prefix?

The SQL standard option would be for Firebird to introduce support for the Unicode character string literal U&'...' UESCAPE '<char>' (where the default unicode escape is \ if the UESCAPE clause is absent).

With this syntax, your example would be U&'\C3B6'.

From 5.3 <literal>:

<Unicode character string literal> ::=
[ <introducer> <character set specification> ]
U <ampersand> <quote> [ <Unicode representation>... ] <quote>
[ { <separator> <quote> [ <Unicode representation>... ] <quote> }... ]
<Unicode escape specifier>

<Unicode representation> ::=
<character representation>
| <Unicode escape value>

Relevant syntax rules:

2) In a <Unicode character string literal>, the sequence:
<quote> <Unicode representation>... <quote> <separator> <quote> <Unicode representation>... <quote>
is equivalent to the sequence:
<quote> <Unicode representation>... <Unicode representation>... <quote>
3) In a <Unicode character string literal>, the introductory 'U' may be represented either in upper case (as 'U') or in lower case (as 'u').
[..]
7) In a <character string literal>, <national character string literal>, <Unicode character string literal>, or <binary string literal>, a <separator> shall contain a <newline>.
[..]
10) In a <Unicode character string literal> that specifies “<introducer><character set specification>”, there shall be no <separator> between the <introducer> and the <character set specification>. 11) In a <Unicode character string literal>, there shall be no <separator> between the “U” and the <ampersand> nor between the <ampersand> and the <quote>. 12) The character set of a <Unicode character string literal> that specifies “<introducer><character set specification>” is the character set specified by the <character set specification>. The character set of a <Unicode character string literal> that does not specify “<introducer><character set specification>” is the character set of the SQL-client module that contains the <Unicode character string literal>. 13) A <Unicode character string literal> is equivalent to a <character string literal> in which every <Unicode escape value> has been replaced with the equivalent Unicode character. The set of characters contained in the <Unicode character string literal> shall be wholly contained in the character set of the <Unicode character string literal>. NOTE 139 — The requirement for “wholly contained” applies after the replacement of <Unicode escape value>s with equivalent Unicode characters. (rules 14-19 are also relevant, but apply to all string literals, left out for brevity)

From 5.2 <token> and <separator>:

<Unicode escape specifier> ::=
[ UESCAPE <quote> <Unicode escape character> <quote> ]

<Unicode escape value> ::=
<Unicode 4 digit escape value>
| <Unicode 6 digit escape value>
| <Unicode character escape value>

<Unicode 4 digit escape value> ::=
<Unicode escape character> <hexit> <hexit> <hexit> <hexit>

<Unicode 6 digit escape value> ::=
<Unicode escape character> <plus sign>
<hexit> <hexit> <hexit> <hexit> <hexit> <hexit>

<Unicode character escape value> ::=
<Unicode escape character> <Unicode escape character>
<Unicode escape character> ::=

Relevant syntax rules:

16) <Unicode escape character> shall be a single character from the source language character set other than a <hexit>, <plus sign>, <quote>, <double quote>, or <white space>. 17) If the source language character set contains <reverse solidus>, then let DEC be <reverse solidus>; otherwise, let DEC be an implementation-defined character from the source language character set that is not a <hexit>, <plus sign>, <quote>, <double quote>, or <white space>. 18) If a <Unicode escape specifier> does not contain <Unicode escape character>, then “UESCAPE <quote>DEC<quote>” is implicit. 19) In a <Unicode escape value> there shall be no <separator> between the <Unicode escape character> and the first <hexit>, nor between any of the <hexit>s.
[..]
21) <Unicode 4 digit escape value> '<Unicode escape character>xyzw' is equivalent to the character at the Unicode code point specified by U+xyzw. 22) <Unicode 6 digit escape value> '<Unicode escape character>+xyzwrs' is equivalent to the character at the Unicode code point specified by U+xyzwrs. NOTE 132 — The 6-hexit notation is derived by taking the UCS-4 notation defined in [ISO10646] and removing the leading two hexits, whose values are always 0 (zero). 23) <Unicode character escape value> is equivalent to a single instance of <Unicode escape character>.

--
Mark Rotteveel


Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Reply via email to