OFFSET

Michael Paquier Sun, 16 Mar 2025 17:53:34 -0700

On Fri, Mar 14, 2025 at 04:14:25PM +1300, David Rowley wrote:
> I don't think I'm discarding them... As far as I'm aware, your
> remaining concern is with custom jumble functions and you showed an
> example in [1] of a hypothetical jumble function that could cause the
> same bug as this thread is discussing. My response to that was that
> the custom jumble function is broken and should be fixed, which seems
> to me to be analogous to Ivan's proposal to fix _jumbleNode() to do
> something with NULL pointers.


Okay, that's where our opinions diverge.  So, well, please let me 
reformulate that with some different words and terms.  This is not a
problem of having a NULL node vs a non-NULL node in the jumbling,
because by design is it possible for one to decide if a Node can be
included in the jumbling depending on its internal values,
particularly if it gets used across one than one type of parent node.
You are claiming that such functions are broken, but what I am saying
is that such function can be OK depending on how one wants this code
to behave depending on their node structure and what they expect from
them.  So I'm making a claim about keeping this stuff more
flexible, while also addressing the problem of the same node defined
successively more than once in a parent structure.

> I now can't tell which of the following is true: 1) I've missed one of
> your concerns, or; 2) you want to find a way to make badly coded
> custom jumble functions not suffer from this bug, or; 3) you think
> that adding jumble bytes unconditionally regardless of the field's
> value still does not fix the bug in question, or; 4) something else.
> It would be good to know which of these it is.

I hope that my opinion counts, as I've worked on this whole design
in 3db72ebcbe20, 8eba3e3f0208 and other related commits.  FWIW, I
don't really see any disadvantage in supporting what I'm claiming is
OK, giving more flexibility to folks to do what they want with this
facility, if it also tackles this thread's problem with the Node
handling.

So, here is attached a counter-proposal, where we can simply added a
counter tracking a node count in _jumbleNode() to add more entropy to
the mix, incrementing it as well for NULL nodes.

By the way, I was looking at the set of tests proposed upthread at
this message, which is as far as I know the latest version proposed:
https://www.postgresql.org/message-id/[email protected]

The tests do require neither a relation nor a stats reset, so let's
make it cheaper as I've proposed upthread and in the attached,
including checks with multiple queries and different constants to make
sure that these are correctly grouped in the pgss reports.  The
null_sequence_number reset each time we run through a non-NULL node
from variant B reduces a bit the entropy, btw..  The argument about
adding a counter in AppendJumble() is the wrong level to use for such
a change.

Anyway, perhaps we should have your extra byte '\0' or something
equivalent added to JUMBLE_STRING() if dealing with a NULL string
rather than ignoring it.  There's an argument about simplicity, IMO,
still it is true that ignoring a NULL value reduces the entropy of the
query ID.  So I'd be OK with that if you feel strongly about this
point, at it sound to me that you are?  This could be better than a
hardcoded '\0' byte as we could use a counter and
JUMBLE_FIELD_SINGLE() in the NULL case of JUMBLE_STRING().  It's not
proved to be a problem yet, and there's value in having a simpler
solution, as well.  Anyway, one line I'd like to draw is that
appendJumble() remains as it is written currently.  One extra thing
that We could do is to update it so as the size of an item given is
always more than zero, with an Assert(), to enforce a stricter policy.
--
Michael

From c78150fea9a49aa2ab06f4b03b1e5beea4457366 Mon Sep 17 00:00:00 2001
From: Michael Paquier <[email protected]>
Date: Mon, 17 Mar 2025 09:38:22 +0900
Subject: [PATCH v4] Add more entropy to query jumbling

A counter tracking the number of nodes computed is added, and all nodes
use it in their computation.  This matters also for NULL nodes, that
were ignored from the computation up to now.
---
 src/include/nodes/queryjumble.h               |  7 ++
 src/backend/nodes/queryjumblefuncs.c          | 10 +++
 .../pg_stat_statements/expected/select.out    | 87 ++++++++++++++++++-
 contrib/pg_stat_statements/sql/select.sql     | 20 +++++
 4 files changed, 123 insertions(+), 1 deletion(-)

diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index 50eb95665872..aa1126f50162 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -37,6 +37,13 @@ typedef struct JumbleState
 	/* Number of bytes used in jumble[] */
 	Size		jumble_len;
 
+	/*
+	 * Number of Nodes included in the computation.  This counter is
+	 * incremented each time a node is added to the jumbling computation,
+	 * and is added to the jumbling to increase its entropy.
+	 */
+	int			node_count;
+
 	/* Array of locations of constants that should be removed */
 	LocationLen *clocations;
 
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index b103a2819366..ec74b92d5c3c 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -120,6 +120,7 @@ JumbleQuery(Query *query)
 	/* Set up workspace for query jumbling */
 	jstate->jumble = (unsigned char *) palloc(JUMBLE_SIZE);
 	jstate->jumble_len = 0;
+	jstate->node_count = 0;
 	jstate->clocations_buf_size = 32;
 	jstate->clocations = (LocationLen *)
 		palloc(jstate->clocations_buf_size * sizeof(LocationLen));
@@ -244,6 +245,15 @@ _jumbleNode(JumbleState *jstate, Node *node)
 {
 	Node	   *expr = node;
 
+	/*
+	 * Increment the node count, and add it to the jumbling.  This operation
+	 * is done before checking if a Node is NULL, so as even a NULL node is
+	 * counted in the computation, without depending on its data, with some
+	 * data that we know to be unique for each computation.
+	 */
+	jstate->node_count++;
+	JUMBLE_FIELD_SINGLE(jstate->node_count);
+
 	if (expr == NULL)
 		return;
 
diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out
index 37a30af034a6..1587d2cafb3a 100644
--- a/contrib/pg_stat_statements/expected/select.out
+++ b/contrib/pg_stat_statements/expected/select.out
@@ -19,6 +19,86 @@ SELECT 1 AS "int";
    1
 (1 row)
 
+-- LIMIT and OFFSET patterns
+-- These require more entropy with parsing node offsets.
+SELECT 1 AS "int" LIMIT 1;
+ int 
+-----
+   1
+(1 row)
+
+SELECT 1 AS "int" LIMIT 2;
+ int 
+-----
+   1
+(1 row)
+
+SELECT 1 AS "int" OFFSET 1;
+ int 
+-----
+(0 rows)
+
+SELECT 1 AS "int" OFFSET 2;
+ int 
+-----
+(0 rows)
+
+SELECT 1 AS "int" OFFSET 1 LIMIT 1;
+ int 
+-----
+(0 rows)
+
+SELECT 1 AS "int" OFFSET 2 LIMIT 2;
+ int 
+-----
+(0 rows)
+
+SELECT 1 AS "int" LIMIT 1 OFFSET 1;
+ int 
+-----
+(0 rows)
+
+SELECT 1 AS "int" LIMIT 3 OFFSET 3;
+ int 
+-----
+(0 rows)
+
+SELECT 1 AS "int" OFFSET 1 FETCH FIRST 2 ROW ONLY;
+ int 
+-----
+(0 rows)
+
+SELECT 1 AS "int" OFFSET 2 FETCH FIRST 3 ROW ONLY;
+ int 
+-----
+(0 rows)
+
+-- DISTINCT and ORDER BY patterns
+-- These require more entropy with parsing node offsets.
+SELECT DISTINCT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT DISTINCT 2 AS "int";
+ int 
+-----
+   2
+(1 row)
+
+SELECT 1 AS "int" ORDER BY 1;
+ int 
+-----
+   1
+(1 row)
+
+SELECT 2 AS "int" ORDER BY 1;
+ int 
+-----
+   2
+(1 row)
+
 /* this comment should not appear in the output */
 SELECT 'hello'
   -- but this one will appear
@@ -135,9 +215,14 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
      3 |    3 | SELECT $1 + $2 + $3 AS "add"
      1 |    1 | SELECT $1 AS "float"
      2 |    2 | SELECT $1 AS "int"
+     2 |    2 | SELECT $1 AS "int" LIMIT $2
+     2 |    0 | SELECT $1 AS "int" OFFSET $2
+     6 |    0 | SELECT $1 AS "int" OFFSET $2 LIMIT $3
+     2 |    2 | SELECT $1 AS "int" ORDER BY 1
      1 |    2 | SELECT $1 AS i UNION SELECT $2 ORDER BY i
      1 |    1 | SELECT $1 || $2
      1 |    1 | SELECT $1, $2 LIMIT $3
+     2 |    2 | SELECT DISTINCT $1 AS "int"
      0 |    0 | SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C"
      1 |    1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
      1 |    2 | WITH t(f) AS (                                                              +
@@ -145,7 +230,7 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
        |      | )                                                                           +
        |      |   SELECT f FROM t ORDER BY f
      1 |    1 | select $1::jsonb ? $2
-(12 rows)
+(17 rows)
 
 SELECT pg_stat_statements_reset() IS NOT NULL AS t;
  t 
diff --git a/contrib/pg_stat_statements/sql/select.sql b/contrib/pg_stat_statements/sql/select.sql
index e0be58d5e24b..4dcfa8ef74dc 100644
--- a/contrib/pg_stat_statements/sql/select.sql
+++ b/contrib/pg_stat_statements/sql/select.sql
@@ -12,6 +12,26 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
 --
 SELECT 1 AS "int";
 
+-- LIMIT and OFFSET patterns
+-- These require more entropy with parsing node offsets.
+SELECT 1 AS "int" LIMIT 1;
+SELECT 1 AS "int" LIMIT 2;
+SELECT 1 AS "int" OFFSET 1;
+SELECT 1 AS "int" OFFSET 2;
+SELECT 1 AS "int" OFFSET 1 LIMIT 1;
+SELECT 1 AS "int" OFFSET 2 LIMIT 2;
+SELECT 1 AS "int" LIMIT 1 OFFSET 1;
+SELECT 1 AS "int" LIMIT 3 OFFSET 3;
+SELECT 1 AS "int" OFFSET 1 FETCH FIRST 2 ROW ONLY;
+SELECT 1 AS "int" OFFSET 2 FETCH FIRST 3 ROW ONLY;
+
+-- DISTINCT and ORDER BY patterns
+-- These require more entropy with parsing node offsets.
+SELECT DISTINCT 1 AS "int";
+SELECT DISTINCT 2 AS "int";
+SELECT 1 AS "int" ORDER BY 1;
+SELECT 2 AS "int" ORDER BY 1;
+
 /* this comment should not appear in the output */
 SELECT 'hello'
   -- but this one will appear
-- 
2.47.2

signature.asc
Description: PGP signature

Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

Reply via email to