Re: [PATCH] Add CANONICAL option to xmlserialize

Jim Jones Mon, 26 Aug 2024 02:32:24 -0700

Hi Pavel

On 25.08.24 20:57, Pavel Stehule wrote:
>
> There is unwanted white space in the patch
>
> -<-><--><-->xmlFreeDoc(doc);
> +<->else if (format == XMLSERIALIZE_CANONICAL || format ==
> XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
> + <>{
> +<-><-->xmlChar    *xmlbuf = NULL;
> +<-><-->int         nbytes;
> +<-><-->int    
>
I missed that one. Just removed it, thanks!
> 1. the xml is serialized to UTF8 string every time, but when target
> type is varchar or text, then it should be every time encoded to
> database encoding. Is not possible to hold utf8 string in latin2
> database varchar.
I'm calling xml_parse using GetDatabaseEncoding(), so I thought I would
be on the safe side


if(format ==XMLSERIALIZE_CANONICAL ||format
==XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
doc =xml_parse(data, XMLOPTION_DOCUMENT, false,
GetDatabaseEncoding(), NULL, NULL, NULL);
... or you mean something else?

> 2. The proposed feature can increase some confusion in implementation
> of NO IDENT. I am not an expert on this area, so I checked other
> databases. DB2 does not have anything similar. But Oracle's "NO IDENT"
> clause is very similar to the proposed "CANONICAL". Unfortunately,
> there is different behaviour of NO IDENT - Oracle's really removes
> formatting, Postgres does nothing.

Coincidentally, the [NO] INDENT support for xmlserialize is an old patch
of mine.
It basically "does nothing" and prints the xml as is, e.g.

SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
a="8"><![CDATA[0&1]]></val></bar></foo>' AS text INDENT);
                xmlserialize                
--------------------------------------------
 <foo>                                     +
   <bar>                                   +
     <val z="1" a="8"><![CDATA[0&1]]></val>+
   </bar>                                  +
 </foo>                                    +
 
(1 row)

SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
a="8"><![CDATA[0&1]]></val></bar></foo>' AS text NO INDENT);
                         xmlserialize                         
--------------------------------------------------------------
 <foo><bar><val z="1" a="8"><![CDATA[0&1]]></val></bar></foo>
(1 row)

SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
a="8"><![CDATA[0&1]]></val></bar></foo>' AS text);
                         xmlserialize                         
--------------------------------------------------------------
 <foo><bar><val z="1" a="8"><![CDATA[0&1]]></val></bar></foo>
(1 row)

.. while CANONICAL converts the xml to its canonical form,[1,2] e.g.
sorting attributes and replacing CDATA strings by its value:

SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
a="8"><![CDATA[0&1]]></val></bar></foo>' AS text CANONICAL);
                     xmlserialize                     
------------------------------------------------------
 <foo><bar><val a="8" z="1">0&amp;1</val></bar></foo>
(1 row)

xmlserialize CANONICAL does not exist in any other database and it's not
part of the SQL/XML standard.

Regarding the different behaviour of NO INDENT in Oracle and PostgreSQL:
it is not entirely clear to me if SQL/XML states that NO INDENT must
remove the indentation from xml strings.
It says:

"INDENT — the choice of whether to “pretty-print” the serialized XML by
means of indentation, either
True or False.
....
i) If <XML serialize indent> is specified and does not contain NO, then
let IND be True.
ii) Otherwise, let IND be False."

When I wrote the patch I assumed it meant to leave the xml as is .. but
I might be wrong.
Perhaps it would be best if we open a new thread for this topic.

Thank you for reviewing this patch. Much appreciated!

Best,

-- 
Jim

1 - https://www.w3.org/TR/xml-c14n11/
2 - https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-c14n.html

Re: [PATCH] Add CANONICAL option to xmlserialize

Reply via email to