On 9/17/19 6:40 PM, Tom Lane wrote: > "Jonathan S. Katz" <[email protected]> writes: >> v2 attached. Thanks! > > I whacked this around some (well, quite a bit actually);
So I see :) Thanks. > notably, > I thought we'd better describe things that are in our engine but > not XQuery, as well as vice-versa. Yeah, that makes sense. Overall it reads really well. One question I had in my head (and probably should have asked) was answered around the \w character class wrt collation. > After a re-read of the XQuery spec, it seems to me that the character > entry form that they have and we don't is actually "&#NNNN;" like > HTML, rather than just "#NN". Can anyone double-check that? Clicking through the XQuery spec eventual got me to here[1] (which warns me that its out of date, but that is what its "current" specs linked me to), which describes being able to use "&#[0-9]+;" and "&#[0-9a-fA-F]+;" to specify characters (which I recognize as a character escape from HTML, XML et al.). So based on that, my answer is "yes." > Does > it work outside bracket expressions, or only inside? Looking at the parse tree (start with the "atom"[2]), I read it as being able to use that syntax both inside and outside the bracket expressions. Here is a v4. I added some more paragraphs the bullet point that explains the different flags to make it feel a bit less dense. Thanks, Jonathan [1] https://www.w3.org/TR/2000/WD-xml-2e-20000814#dt-charref [2] https://www.w3.org/TR/xmlschema-2/#nt-atom
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2b4fe0cb59..f8ced55daa 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5970,6 +5970,138 @@ SELECT regexp_match('abc01234xyz',
'(?:(.*?)(\d+)(.*)){1,1}');
<!-- end re_syntax.n man page -->
+ <sect3 id="posix-vs-xquery">
+ <title>Differences From XQuery (<literal>LIKE_REGEX</literal>)</title>
+
+ <para>
+ Since SQL:2008, the SQL standard includes
+ a <literal>LIKE_REGEX</literal> operator that performs pattern
+ matching according to the XQuery regular expression
+ standard. <productname>PostgreSQL</productname> does not yet
+ implement this operator, but you can get very similar behavior using
+ the <function>regexp_match()</function> function.
+ </para>
+
+ <para>
+ Notable differences between the existing POSIX-based
+ regular-expression feature and XQuery regular expressions include:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ XQuery character class subtraction is not supported. An example of
+ this feature is using the following to match only English
+ consonants: <literal>[a-z-[aeiou]]</literal>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ XQuery allows a literal character in the pattern to be written as
+ an HTML-style Unicode character reference, for
+ instance <literal>&#<replaceable>NNNN</replaceable>;</literal>.
+ This is not supported by POSIX, but you can get the same effect by
+ writing <literal>\u<replaceable>NNNN</replaceable></literal>. (The
+ equivalence is only exact when the database encoding is UTF-8.)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ The SQL standard (not XQuery itself) attempts to cater for more
+ variants of <quote>newline</quote> than POSIX does. The
+ newline-sensitive matching options described above consider only
+ ASCII NL (<literal>\n</literal>) to be a newline, but SQL would have
+ us treat CR (<literal>\r</literal>), CRLF (<literal>\r\n</literal>)
+ (a Windows-style newline), and some Unicode-only characters like
+ LINE SEPARATOR (U+2028) as newlines as well.
+ Notably, <literal>.</literal> and <literal>\s</literal> should
+ count <literal>\r\n</literal> as one character not two according to
+ SQL.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ XQuery character class shorthands <literal>\c</literal>,
+ <literal>\C</literal>, <literal>\i</literal>,
+ and <literal>\I</literal> are not supported.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ XQuery character class elements
+ using <literal>\p{UnicodeProperty}</literal> or the
+ inverse <literal>\P{UnicodeProperty}</literal> are not supported.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ POSIX interprets character classes such as <literal>\w</literal>
+ (see <xref linkend="posix-class-shorthand-escapes-table"/>)
+ according to the prevailing locale (which you can control by
+ attaching a <literal>COLLATE</literal> clause to the operator or
+ function). XQuery specifies these classes by reference to Unicode
+ character properties, so equivalent behavior is obtained only with
+ a locale that follows the Unicode rules.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Of the character-entry escapes described in
+ <xref linkend="posix-character-entry-escapes-table"/>,
+ XQuery supports only <literal>\n</literal>, <literal>\r</literal>,
+ and <literal>\t</literal>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ XQuery does not support
+ the <literal>[:<replaceable>name</replaceable>:]</literal> syntax
+ for character classes within bracket expressions.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ XQuery does not have lookahead or lookbehind constraints,
+ nor any of the constraint escapes described in
+ <xref linkend="posix-constraint-escapes-table"/>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ The metasyntax forms described in <xref linkend="posix-metasyntax"/>
+ do not exist in XQuery.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ The regular expression flag letters defined by XQuery are
+ related to but not the same as the option letters for POSIX
+ (<xref linkend="posix-embedded-options-table"/>). While the
+ <literal>i</literal> and <literal>q</literal> options behave the
+ same, others do not.
+ </para>
+ <para>
+ XQuery's <literal>s</literal> (allow dot to match newline)
+ and <literal>m</literal> (allow <literal>^</literal>
+ and <literal>$</literal> to match at newlines) flags provide access
+ to the same behaviors as POSIX's <literal>n</literal>,
+ <literal>p</literal> and <literal>w</literal> flags, but
+ do <emphasis>not</emphasis> match the behavior of
+ POSIX's <literal>s</literal> and <literal>m</literal> flags.
+ Note in particular that dot-matches-newline is the default behavior
+ in POSIX but not XQuery.
+ </para>
+ <para>
+ Also, XQuery's <literal>x</literal> (ignore whitespace in pattern)
+ flag is noticeably different from POSIX's expanded-mode flag.
+ POSIX's <literal>x</literal> flag also allows <literal>#</literal> to
+ begin a comment in the pattern, and POSIX will not ignore a
+ whitespace character after a backslash.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ </sect3>
</sect2>
</sect1>
@@ -11793,6 +11925,14 @@ table2-mapping
</programlisting>
</para>
</listitem>
+
+ <listitem>
+ <para>
+ There are minor differences in the interpretation of regular
+ expression patterns used in <literal>like_regex</literal> filters, as
+ described in <xref linkend="jsonpath-regular-expressions"/>.
+ </para>
+ </listitem>
</itemizedlist>
<sect3 id="strict-and-lax-modes">
@@ -11872,6 +12012,36 @@ table2-mapping
</sect3>
+ <sect3 id="jsonpath-regular-expressions">
+ <title>Regular Expressions</title>
+
+ <para>
+ SQL/JSON path expressions allow matching text to a regular expression
+ with the <literal>like_regex</literal> filter. For example, the
+ following SQL/JSON path query would case-insensitively match all
+ strings in an array that start with an English vowel:
+<programlisting>
+'$[*] ? (@ like_regex "^[aeiou]" flag "i")'
+</programlisting>
+ </para>
+
+ <para>
+ The SQL/JSON standard borrows its definition for regular expressions
+ from the <literal>LIKE_REGEX</literal> operator, which in turn uses the
+ XQuery standard. PostgreSQL does not currently support the
+ <literal>LIKE_REGEX</literal> operator. Therefore,
+ the <literal>like_regex</literal> filter is implemented using the
+ POSIX regular expression engine described in
+ <xref linkend="functions-posix-regexp"/>. This leads to various minor
+ discrepancies from standard SQL/JSON behavior, which are cataloged in
+ <xref linkend="posix-vs-xquery"/>.
+ Note, however, that the flag-letter incompatibilities described there
+ do not apply to SQL/JSON, as it translates the XQuery flag letters to
+ match what the POSIX engine expects.
+ </para>
+
+ </sect3>
+
<sect3 id="functions-sqljson-path-operators">
<title>SQL/JSON Path Operators and Methods</title>
@@ -12113,10 +12283,13 @@ table2-mapping
<row>
<entry><literal>like_regex</literal></entry>
<entry>
- Tests pattern matching with POSIX regular expressions
- (see <xref linkend="functions-posix-regexp"/>). Supported flags
- are <literal>i</literal>, <literal>s</literal>, <literal>m</literal>,
- <literal>x</literal>, and <literal>q</literal>.</entry>
+ Tests whether the first operand matches the regular expression
+ given by the second operand (see
+ <xref linkend="jsonpath-regular-expressions"/>).
+ An optional <literal>flag</literal> string can be given.
+ Supported flags are <literal>i</literal>, <literal>m</literal>,
+ <literal>s</literal>, and <literal>q</literal>.
+ </entry>
<entry><literal>["abc", "abd", "aBdC", "abdacb",
"babc"]</literal></entry>
<entry><literal>$[*] ? (@ like_regex "^ab.*c" flag
"i")</literal></entry>
<entry><literal>"abc", "aBdC", "abdacb"</literal></entry>
signature.asc
Description: OpenPGP digital signature
