On 25-9-2019 10:11, Kjell Rilbe wrote:
The suggested <binary string literal> could be used to write characters
using the literal's encoding directly. E.g. För UTF-8 literal, the
character Ö could be written as '\xC3\xB6', and in an WIN1252 literal it
could be written as '\xD6'.
Since these kinds of escapes would be a breaking change to how string
literals are parsed, a solution would have to be found to determine if a
specific string literal is to be parsed with these kinds of escapes or
not. A prefix than could be combined with any character set prefix?
The SQL standard option would be for Firebird to introduce support for
the Unicode character string literal U&'...' UESCAPE '<char>' (where the
default unicode escape is \ if the UESCAPE clause is absent).
With this syntax, your example would be U&'\C3B6'.
From 5.3 <literal>:
<Unicode character string literal> ::=
[ <introducer> <character set specification> ]
U <ampersand> <quote> [ <Unicode representation>... ] <quote>
[ { <separator> <quote> [ <Unicode representation>... ] <quote> }... ]
<Unicode escape specifier>
<Unicode representation> ::=
<character representation>
| <Unicode escape value>
Relevant syntax rules:
2) In a <Unicode character string literal>, the sequence:
<quote> <Unicode representation>... <quote> <separator> <quote> <Unicode
representation>... <quote>
is equivalent to the sequence:
<quote> <Unicode representation>... <Unicode representation>... <quote>
3) In a <Unicode character string literal>, the introductory 'U' may be
represented either in upper case (as 'U') or in lower case (as 'u').
[..]
7) In a <character string literal>, <national character string literal>,
<Unicode character string literal>, or <binary string literal>, a
<separator> shall contain a <newline>.
[..]
10) In a <Unicode character string literal> that specifies
“<introducer><character set specification>”, there shall be no
<separator> between the <introducer> and the <character set specification>.
11) In a <Unicode character string literal>, there shall be no
<separator> between the “U” and the <ampersand> nor between the
<ampersand> and the <quote>.
12) The character set of a <Unicode character string literal> that
specifies “<introducer><character set specification>” is the character
set specified by the <character set specification>. The character set of
a <Unicode character string literal> that does not specify
“<introducer><character set specification>” is the character set of the
SQL-client module that contains the <Unicode character string literal>.
13) A <Unicode character string literal> is equivalent to a <character
string literal> in which every <Unicode escape value> has been replaced
with the equivalent Unicode character. The set of characters contained
in the <Unicode character string literal> shall be wholly contained in
the character set of the <Unicode character string literal>.
NOTE 139 — The requirement for “wholly contained” applies after the
replacement of <Unicode escape value>s with equivalent Unicode characters.
(rules 14-19 are also relevant, but apply to all string literals, left
out for brevity)
From 5.2 <token> and <separator>:
<Unicode escape specifier> ::=
[ UESCAPE <quote> <Unicode escape character> <quote> ]
<Unicode escape value> ::=
<Unicode 4 digit escape value>
| <Unicode 6 digit escape value>
| <Unicode character escape value>
<Unicode 4 digit escape value> ::=
<Unicode escape character> <hexit> <hexit> <hexit> <hexit>
<Unicode 6 digit escape value> ::=
<Unicode escape character> <plus sign>
<hexit> <hexit> <hexit> <hexit> <hexit> <hexit>
<Unicode character escape value> ::=
<Unicode escape character> <Unicode escape character>
<Unicode escape character> ::=
Relevant syntax rules:
16) <Unicode escape character> shall be a single character from the
source language character set other than a <hexit>, <plus sign>,
<quote>, <double quote>, or <white space>.
17) If the source language character set contains <reverse solidus>,
then let DEC be <reverse solidus>; otherwise, let DEC be an
implementation-defined character from the source language character set
that is not a <hexit>, <plus sign>, <quote>, <double quote>, or <white
space>.
18) If a <Unicode escape specifier> does not contain <Unicode escape
character>, then “UESCAPE <quote>DEC<quote>” is implicit.
19) In a <Unicode escape value> there shall be no <separator> between
the <Unicode escape character> and the first <hexit>, nor between any of
the <hexit>s.
[..]
21) <Unicode 4 digit escape value> '<Unicode escape character>xyzw' is
equivalent to the character at the Unicode code point specified by U+xyzw.
22) <Unicode 6 digit escape value> '<Unicode escape character>+xyzwrs'
is equivalent to the character at the Unicode code point specified by
U+xyzwrs.
NOTE 132 — The 6-hexit notation is derived by taking the UCS-4 notation
defined in [ISO10646] and removing the leading two hexits, whose values
are always 0 (zero).
23) <Unicode character escape value> is equivalent to a single instance
of <Unicode escape character>.
--
Mark Rotteveel
Firebird-Devel mailing list, web interface at
https://lists.sourceforge.net/lists/listinfo/firebird-devel