Hi all, (Adding in CC Tom and Eric, as committer and author.) A customer has reported a regression with the parsing of rather large XML data, introduced by the set of backpatches done with f68d6aabb7e2 & friends.
The problem is introduced by the change from xmlParseBalancedChunkMemory() to xmlNewNode() + xmlParseInNodeContext() in xml_parse(), to avoid an issue in xmlParseBalancedChunkMemory() in the range of libxml2 2.13.0-2.13.2 for a bug that has already been fixed upstream, where we use a temporary root node for the case where parse_as_document is false. If the input XML data is large enough, one gets a failure at the top of the latest branches, and it worked properly before. Here is a short test case (courtesy of a colleague, case that I've modified slightly): CREATE TABLE xmldata (id BIGINT PRIMARY KEY, message XML ); DO $$ DECLARE size_40mb TEXT := repeat('X', 40000000); BEGIN BEGIN INSERT INTO xmldata (id, message) VALUES ( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || '</Content></Item></Root>')::xml) ); RAISE NOTICE 'insert 40MB successful'; EXCEPTION WHEN OTHERS THEN RAISE NOTICE 'Error insert 40MB: %', SQLERRM; END; END $$; Switching back to the previous code, where we rely on xmlParseBalancedChunkMemory() fixes the issue. A quick POC is attached. It fails one case in check-world with SERIALIZE because I am not sure it is possible to pass down some options through xmlParseBalancedChunkMemory(), still the regression is gone, and I am wondering if there is not a better solution to be able to dodge the original problem and still accept this case. One good thing is that xmlParseBalancedChunkMemory() is able to return a list of nodes, that we need for this code path of xml_parse(). So perhaps one solution would be the addition of a code path with xmlParseBalancedChunkMemory() depending on the options we want to process, keeping the code path with the fake "content-root" for the XML SERIALIZE case. The patch in question has been applied first to 6082b3d5d3d1 on HEAD impacting v18~, then it has been backpatched down to all stable branches, like f68d6aabb7e2, introducing the regression in all the stable branches since the minor releases done in August 2024, as of: 12.20, 13.16, 14.13, 15.8, 16.4. Thoughts or comments? -- Michael
From 1834a83ec4e013fcc762e07372ca3fa365e2541e Mon Sep 17 00:00:00 2001 From: Michael Paquier <mich...@paquier.xyz> Date: Thu, 24 Jul 2025 12:01:52 +0900 Subject: [PATCH] Fix xml2 regression --- src/backend/utils/adt/xml.c | 27 ++++----------------------- 1 file changed, 4 insertions(+), 23 deletions(-) diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c index f7b731825fca..3a44794a2393 100644 --- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -1900,9 +1900,6 @@ xml_parse(text *data, XmlOptionType xmloption_arg, } else { - xmlNodePtr root; - xmlNodePtr oldroot PG_USED_FOR_ASSERTS_ONLY; - /* set up document with empty root node to be the context node */ doc = xmlNewDoc(version); if (doc == NULL || xmlerrcxt->err_occurred) @@ -1916,31 +1913,15 @@ xml_parse(text *data, XmlOptionType xmloption_arg, "could not allocate XML document"); doc->standalone = standalone; - root = xmlNewNode(NULL, (const xmlChar *) "content-root"); - if (root == NULL || xmlerrcxt->err_occurred) - xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY, - "could not allocate xml node"); - - /* - * This attaches root to doc, so we need not free it separately; - * and there can't yet be any old root to free. - */ - oldroot = xmlDocSetRootElement(doc, root); - Assert(oldroot == NULL); - /* allow empty content */ if (*(utf8string + count)) { xmlNodePtr node_list = NULL; - xmlParserErrors res; - res = xmlParseInNodeContext(root, - (char *) utf8string + count, - strlen((char *) utf8string + count), - options, - &node_list); - - if (res != XML_ERR_OK || xmlerrcxt->err_occurred) + res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, + utf8string + count, + &node_list); + if (res_code != 0 || xmlerrcxt->err_occurred) { xmlFreeNodeList(node_list); xml_errsave(escontext, xmlerrcxt, -- 2.50.0
signature.asc
Description: PGP signature