On Mar 17, 2024, at 20:09, Erik Wienhold <e...@ewie.name> wrote:
> 
> On 2024-03-17 20:50 +0100, David E. Wheeler wrote:
>> On Mar 17, 2024, at 15:12, Erik Wienhold <e...@ewie.name> wrote:
>>> So I think it makes sense to reword the entire backslash part of the
>>> paragraph and remove references to JSON entirely.  The attached patch
>>> does that and also formats the backslash escapes as a bulleted list for
>>> readability.
>> 
>> Ah, it’s JavaScript format, not JSON! This does clarify things quite
>> nicely, thank you. Happy to add my review once it’s in a commit fest.
> 
> Thanks.  https://commitfest.postgresql.org/48/4899/

Applies cleanly, `make -C doc/src/sgml check` runs without error. Doc 
improvement welcome and much clearer than before.

> I had the same reasoning while writing my first reply but scrapped that
> part because I found it obvious:  That jsonpath is parsed before calling
> jsonb_path_exists() and therefore the parser has no context about any
> variables, which might not even be hardcoded but may result from a
> query.

Right, there’s a chicken/egg problem.

> Unfortunately, I don't have access to that part of the SQL spec.  So I
> don't know how the jsonpath grammar is specified.

Seems quite logical; I think it should be documented, but I’d also be 
interested to know what the 2016 and 2023 standards say, exactly.

> Also checked git log src/backend/utils/adt/jsonpath_scan.l for some
> insights but haven't found any yet.

Everybody’s taking shortcuts relative to the standard, AFAICT. For example, 
jsonpath_scan.l matches unqouted identifiers with these two regular expressions:

<xnq>{other}+
<xnq>\/\*
<xnq,xq,xvq>\\.

Plus the backslash escapes. {other} is defined as:

/* "other" means anything that's not special, blank, or '\' or '"' */
other [^\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/\\\" \t\n\r\f]

Which is waaaay more liberal than the ECMA standard[1], by my reading, but the 
MSDN[2] description is quite succinct (thanks for the links!):

> In JavaScript, identifiers are commonly made of alphanumeric characters, 
> underscores (_), and dollar signs ($). Identifiers are not allowed to start 
> with numbers. However, JavaScript identifiers are not only limited to ASCII — 
> many Unicode code points are allowed as well. Namely, any character in the 
> ID_Start category can start an identifier, while any character in the 
> ID_Continue category can appear after the first character.


ID_Start[3] and ID_Continue[4] point to the unicode standard codes lister, 
nether of which reference Emoji. Sure enough, in Safari:

> x = {"🎉": true}
< {🎉: true}
> x.🎉
< SyntaxError: Invalid character '\ud83c’

But in Postgres jsonpath:

david=# select '$.🎉'::jsonpath;
 jsonpath 
----------
 $."🎉"

If the MSDN references to ID_Start and ID_Continue are correct, then the 
Postgres path parser is being overly-liberal. Maybe that’s totally fine? Not 
sure what should be documented and what’s not worth it.

Aside: I’m only digging into these details because I’m busy porting the path 
parser, so trying to figure out where to be compatible and where not to. So far 
I’m rejecting '$' (but allowing '\$' and '\u0024') but taking advantage of the 
unicode support in Go to specifically validate against ID_Start and ID_Continue.

Best,

David

[1] https://262.ecma-international.org/#sec-identifier-names
[2] 
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#identifiers
[3] 
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BID_Start%7D
[4] 
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BID_Continue%7D









Reply via email to