Hi Pavel
On 25.08.24 20:57, Pavel Stehule wrote:
>
> There is unwanted white space in the patch
>
> -<-><--><-->xmlFreeDoc(doc);
> +<->else if (format == XMLSERIALIZE_CANONICAL || format ==
> XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
> + <>{
> +<-><-->xmlChar *xmlbuf = NULL;
> +<-><-->int nbytes;
> +<-><-->int
>
I missed that one. Just removed it, thanks!
> 1. the xml is serialized to UTF8 string every time, but when target
> type is varchar or text, then it should be every time encoded to
> database encoding. Is not possible to hold utf8 string in latin2
> database varchar.
I'm calling xml_parse using GetDatabaseEncoding(), so I thought I would
be on the safe side
if(format ==XMLSERIALIZE_CANONICAL ||format
==XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
doc =xml_parse(data, XMLOPTION_DOCUMENT, false,
GetDatabaseEncoding(), NULL, NULL, NULL);
... or you mean something else?
> 2. The proposed feature can increase some confusion in implementation
> of NO IDENT. I am not an expert on this area, so I checked other
> databases. DB2 does not have anything similar. But Oracle's "NO IDENT"
> clause is very similar to the proposed "CANONICAL". Unfortunately,
> there is different behaviour of NO IDENT - Oracle's really removes
> formatting, Postgres does nothing.
Coincidentally, the [NO] INDENT support for xmlserialize is an old patch
of mine.
It basically "does nothing" and prints the xml as is, e.g.
SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
a="8"><![CDATA[0&1]]></val></bar></foo>' AS text INDENT);
xmlserialize
--------------------------------------------
<foo> +
<bar> +
<val z="1" a="8"><![CDATA[0&1]]></val>+
</bar> +
</foo> +
(1 row)
SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
a="8"><![CDATA[0&1]]></val></bar></foo>' AS text NO INDENT);
xmlserialize
--------------------------------------------------------------
<foo><bar><val z="1" a="8"><![CDATA[0&1]]></val></bar></foo>
(1 row)
SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
a="8"><![CDATA[0&1]]></val></bar></foo>' AS text);
xmlserialize
--------------------------------------------------------------
<foo><bar><val z="1" a="8"><![CDATA[0&1]]></val></bar></foo>
(1 row)
.. while CANONICAL converts the xml to its canonical form,[1,2] e.g.
sorting attributes and replacing CDATA strings by its value:
SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
a="8"><![CDATA[0&1]]></val></bar></foo>' AS text CANONICAL);
xmlserialize
------------------------------------------------------
<foo><bar><val a="8" z="1">0&1</val></bar></foo>
(1 row)
xmlserialize CANONICAL does not exist in any other database and it's not
part of the SQL/XML standard.
Regarding the different behaviour of NO INDENT in Oracle and PostgreSQL:
it is not entirely clear to me if SQL/XML states that NO INDENT must
remove the indentation from xml strings.
It says:
"INDENT — the choice of whether to “pretty-print” the serialized XML by
means of indentation, either
True or False.
....
i) If <XML serialize indent> is specified and does not contain NO, then
let IND be True.
ii) Otherwise, let IND be False."
When I wrote the patch I assumed it meant to leave the xml as is .. but
I might be wrong.
Perhaps it would be best if we open a new thread for this topic.
Thank you for reviewing this patch. Much appreciated!
Best,
--
Jim
1 - https://www.w3.org/TR/xml-c14n11/
2 - https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-c14n.html