Re: [Firebird-devel] String escapes for codepoints

Mark Rotteveel Wed, 25 Sep 2019 09:07:02 -0700

On 25-9-2019 10:11, Kjell Rilbe wrote:

The suggested <binary string literal> could be used to write characters
using the literal's encoding directly. E.g. För UTF-8 literal, the
character Ö could be written as '\xC3\xB6', and in an WIN1252 literal it
could be written as '\xD6'.


Since these kinds of escapes would be a breaking change to how string
literals are parsed, a solution would have to be found to determine if a
specific string literal is to be parsed with these kinds of escapes or
not. A prefix than could be combined with any character set prefix?

The SQL standard option would be for Firebird to introduce support forthe Unicode character string literal U&'...' UESCAPE '<char>' (where thedefault unicode escape is \ if the UESCAPE clause is absent).


With this syntax, your example would be U&'\C3B6'.

From 5.3 <literal>:

<Unicode character string literal> ::=
[ <introducer> <character set specification> ]
U <ampersand> <quote> [ <Unicode representation>... ] <quote>
[ { <separator> <quote> [ <Unicode representation>... ] <quote> }... ]
<Unicode escape specifier>

<Unicode representation> ::=
<character representation>
| <Unicode escape value>

Relevant syntax rules:

2) In a <Unicode character string literal>, the sequence:

<quote> <Unicode representation>... <quote> <separator> <quote> <Unicoderepresentation>... <quote>

is equivalent to the sequence:
<quote> <Unicode representation>... <Unicode representation>... <quote>

3) In a <Unicode character string literal>, the introductory 'U' may berepresented either in upper case (as 'U') or in lower case (as 'u').

[..]

7) In a <character string literal>, <national character string literal>,<Unicode character string literal>, or <binary string literal>, a<separator> shall contain a <newline>.

[..]

10) In a <Unicode character string literal> that specifies“<introducer><character set specification>”, there shall be no<separator> between the <introducer> and the <character set specification>.11) In a <Unicode character string literal>, there shall be no<separator> between the “U” and the <ampersand> nor between the<ampersand> and the <quote>.12) The character set of a <Unicode character string literal> thatspecifies “<introducer><character set specification>” is the characterset specified by the <character set specification>. The character set ofa <Unicode character string literal> that does not specify“<introducer><character set specification>” is the character set of theSQL-client module that contains the <Unicode character string literal>.13) A <Unicode character string literal> is equivalent to a <characterstring literal> in which every <Unicode escape value> has been replacedwith the equivalent Unicode character. The set of characters containedin the <Unicode character string literal> shall be wholly contained inthe character set of the <Unicode character string literal>.NOTE 139 — The requirement for “wholly contained” applies after thereplacement of <Unicode escape value>s with equivalent Unicode characters.(rules 14-19 are also relevant, but apply to all string literals, leftout for brevity)


From 5.2 <token> and <separator>:

<Unicode escape specifier> ::=
[ UESCAPE <quote> <Unicode escape character> <quote> ]

<Unicode escape value> ::=
<Unicode 4 digit escape value>
| <Unicode 6 digit escape value>
| <Unicode character escape value>

<Unicode 4 digit escape value> ::=
<Unicode escape character> <hexit> <hexit> <hexit> <hexit>

<Unicode 6 digit escape value> ::=
<Unicode escape character> <plus sign>
<hexit> <hexit> <hexit> <hexit> <hexit> <hexit>

<Unicode character escape value> ::=
<Unicode escape character> <Unicode escape character>
<Unicode escape character> ::=

Relevant syntax rules:

16) <Unicode escape character> shall be a single character from thesource language character set other than a <hexit>, <plus sign>,<quote>, <double quote>, or <white space>.17) If the source language character set contains <reverse solidus>,then let DEC be <reverse solidus>; otherwise, let DEC be animplementation-defined character from the source language character setthat is not a <hexit>, <plus sign>, <quote>, <double quote>, or <whitespace>.18) If a <Unicode escape specifier> does not contain <Unicode escapecharacter>, then “UESCAPE <quote>DEC<quote>” is implicit.19) In a <Unicode escape value> there shall be no <separator> betweenthe <Unicode escape character> and the first <hexit>, nor between any ofthe <hexit>s.

[..]

21) <Unicode 4 digit escape value> '<Unicode escape character>xyzw' isequivalent to the character at the Unicode code point specified by U+xyzw.22) <Unicode 6 digit escape value> '<Unicode escape character>+xyzwrs'is equivalent to the character at the Unicode code point specified byU+xyzwrs.NOTE 132 — The 6-hexit notation is derived by taking the UCS-4 notationdefined in [ISO10646] and removing the leading two hexits, whose valuesare always 0 (zero).23) <Unicode character escape value> is equivalent to a single instanceof <Unicode escape character>.


--
Mark Rotteveel


Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Re: [Firebird-devel] String escapes for codepoints

Reply via email to