Nikita Glukhov <n.glu...@postgrespro.ru> writes: > Fixed patch attached.
I looked through this again, and I still don't feel very happy with it. As I mentioned upthread, I'm not convinced that we ought to have *both* quoting and backslash-escaping in these datatypes. TIMTOWTDI may be a virtue to the Perl crowd, but to a lot of other people it's just confusing and extra complication. In particular it will translate directly to extra complication in every application that wants to use non-alphanumeric labels, since they'll have to be prepared to deparse whatever style the backend happens to put out. It's also adding a far from trivial amount of complication to the patch itself. And it adds bug-prone ambiguity; for instance, the patch fails to document whether backslashes do anything special within quotes. On balance, since the existing limitations on labels are an approximation of SQL identifier rules, I think it'd be a good idea to use SQL-identifier quoting rules; that is, you have to use quotes for anything that isn't alphanumeric, backslashes are not special, and the way to get a double quote that's data is to write two adjacent double quotes. That's simple and it doesn't create any confusion when writing an ltree value inside a SQL literal. It's less great if you're trying to write an ltree within an array or record literal value, but we can't have everything (and the backslash alternative would be no better for that anyway). I've still got mixed emotions about the exception of allowing whitespace to be stripped before or after a label. While that probably doesn't pose any forward-compatibility hazards (ie, it's unlikely we'd want to redefine such syntax to mean something different), I'm not sure it passes the least-confusion test. We were just getting beat up a couple days ago about the fact that record_in is lax about whitespace [1][2], so maybe we shouldn't make that same choice here. As far as the code itself goes, I don't think the code structure in ltree_io.c is very well chosen. There seem to be half a dozen routines that are all intimately dependent on the quoting/escaping rules, and it's really pretty hard to be sure that they're all in sync (and if they're not, that's going to lead to crashes and perhaps security-grade bugs). Moreover it looks like you end up making multiple passes over the input, so it's inefficient as well as hard to understand/maintain. Given the choice to have a separate frontend function that recognizes a label as a single token, it seems like it'd be better if that function were also responsible for generating a de-quoted label value as it went. I also feel that not enough attention has been paid to the error reporting. For instance, the patch changes the longstanding policy of reporting an overlength label with its end position: @@ -53,7 +54,7 @@ SELECT repeat('x', 255)::ltree; SELECT repeat('x', 256)::ltree; ERROR: label string is too long -DETAIL: Label length is 256, must be at most 255, at character 257. +DETAIL: Label length is 256, must be at most 255, at character 1. SELECT ltree2text('1.2.3.34.sdf'); ltree2text -------------- That might be an OK choice in a green field, but this isn't a green field, even if we did change the wording of the message since v12. I also noted some random inconsistencies such as regression=# select '1234567890*1234567890'::ltree; ERROR: ltree syntax error at character 11 LINE 1: select '1234567890*1234567890'::ltree; ^ regression=# select '1234567890+1234567890'::ltree; ERROR: ltree syntax error at character 11 LINE 1: select '1234567890+1234567890'::ltree; ^ DETAIL: Unexpected character Apparently that's because * is special to lquery while + isn't, but that distinction really shouldn't matter to ltree. (This DETAIL message isn't close to following project style for detail messages, either.) It'd probably be better if the frontend function didn't contain assumptions about which punctuation characters are important. Another thought here, though it's not really the fault of this patch, is that I really think ltree ought to go over to a treatment of non-ASCII characters that's more like the core parser's, ie anything that isn't ASCII is data (or assumed-to-be-alphanumeric, if you prefer). The trouble with the current definition is that it's dependent on LC_CTYPE, so labels that are acceptable to one database might not be acceptable elsewhere. We could remove that hazard, and as a bonus eliminate some possibly-expensive character classification tests, if we just said all non-ASCII characters are data. regards, tom lane [1] https://www.postgresql.org/message-id/bce3f9fc-6be9-44d8-ad25-f5ff109c7...@yugabyte.com [2] https://www.postgresql.org/message-id/158593753274.7901.11178770123847486396%40wrigleys.postgresql.org