Hi all,
(Adding in CC Tom and Eric, as committer and  author.)

A customer has reported a regression with the parsing of rather large
XML data, introduced by the set of backpatches done with f68d6aabb7e2
& friends.

The problem is introduced by the change from
xmlParseBalancedChunkMemory() to xmlNewNode() +
xmlParseInNodeContext() in xml_parse(), to avoid an issue in
xmlParseBalancedChunkMemory() in the range of libxml2 2.13.0-2.13.2
for a bug that has already been fixed upstream, where we use a
temporary root node for the case where parse_as_document is false.

If the input XML data is large enough, one gets a failure at the top
of the latest branches, and it worked properly before.  Here is a
short test case (courtesy of a colleague, case that I've modified
slightly):
CREATE TABLE xmldata (id BIGINT PRIMARY KEY, message XML );
DO $$ DECLARE size_40mb TEXT := repeat('X', 40000000);
BEGIN
 BEGIN
   INSERT INTO xmldata (id, message) VALUES
     ( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || 
'</Content></Item></Root>')::xml) );
   RAISE NOTICE 'insert 40MB successful';
   EXCEPTION WHEN OTHERS THEN RAISE NOTICE 'Error insert 40MB: %', SQLERRM;
 END;
END $$;

Switching back to the previous code, where we rely on
xmlParseBalancedChunkMemory() fixes the issue.  A quick POC is
attached.  It fails one case in check-world with SERIALIZE because I
am not sure it is possible to pass down some options through
xmlParseBalancedChunkMemory(), still the regression is gone, and I am
wondering if there is not a better solution to be able to dodge the
original problem and still accept this case.  One good thing is that
xmlParseBalancedChunkMemory() is able to return a list of nodes, that
we need for this code path of xml_parse().  So perhaps one solution
would be the addition of a code path with
xmlParseBalancedChunkMemory() depending on the options we want to
process, keeping the code path with the fake "content-root" for the
XML SERIALIZE case.

The patch in question has been applied first to 6082b3d5d3d1 on HEAD
impacting v18~, then it has been backpatched down to all stable
branches, like f68d6aabb7e2, introducing the regression in all the
stable branches since the minor releases done in August 2024, as of:
12.20, 13.16, 14.13, 15.8, 16.4.

Thoughts or comments?
--
Michael
From 1834a83ec4e013fcc762e07372ca3fa365e2541e Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Thu, 24 Jul 2025 12:01:52 +0900
Subject: [PATCH] Fix xml2 regression

---
 src/backend/utils/adt/xml.c | 27 ++++-----------------------
 1 file changed, 4 insertions(+), 23 deletions(-)

diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index f7b731825fca..3a44794a2393 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -1900,9 +1900,6 @@ xml_parse(text *data, XmlOptionType xmloption_arg,
 		}
 		else
 		{
-			xmlNodePtr	root;
-			xmlNodePtr	oldroot PG_USED_FOR_ASSERTS_ONLY;
-
 			/* set up document with empty root node to be the context node */
 			doc = xmlNewDoc(version);
 			if (doc == NULL || xmlerrcxt->err_occurred)
@@ -1916,31 +1913,15 @@ xml_parse(text *data, XmlOptionType xmloption_arg,
 							"could not allocate XML document");
 			doc->standalone = standalone;
 
-			root = xmlNewNode(NULL, (const xmlChar *) "content-root");
-			if (root == NULL || xmlerrcxt->err_occurred)
-				xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
-							"could not allocate xml node");
-
-			/*
-			 * This attaches root to doc, so we need not free it separately;
-			 * and there can't yet be any old root to free.
-			 */
-			oldroot = xmlDocSetRootElement(doc, root);
-			Assert(oldroot == NULL);
-
 			/* allow empty content */
 			if (*(utf8string + count))
 			{
 				xmlNodePtr	node_list = NULL;
-				xmlParserErrors res;
 
-				res = xmlParseInNodeContext(root,
-											(char *) utf8string + count,
-											strlen((char *) utf8string + count),
-											options,
-											&node_list);
-
-				if (res != XML_ERR_OK || xmlerrcxt->err_occurred)
+				res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0,
+													   utf8string + count,
+													   &node_list);
+				if (res_code != 0 || xmlerrcxt->err_occurred)
 				{
 					xmlFreeNodeList(node_list);
 					xml_errsave(escontext, xmlerrcxt,
-- 
2.50.0

Attachment: signature.asc
Description: PGP signature

Reply via email to