Hi all,

I'm investigating the issue I reported here:
https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org

As Tom Lane mentioned there, the docs (8.13) indicate xmloption = CONTENT
should accept all valid XML.  At this time, XML with a DOCTYPE declaration
is not accepted with this setting even though it is considered valid XML.
I'd like to work on a patch to address this issue and make it work as
advertised.

I traced the source of the error to line ~1500 in
/src/backend/utils/adt/xml.c

res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, utf8string +
count, NULL);

It looks like it is xmlParseBalancedChunkMemory from libxml that doesn't
work when there's a DOCTYPE in the XML data. My assumption is the DOCTYPE
element makes the XML not well-balanced.  From:

http://xmlsoft.org/html/libxml-parser.html#xmlParseBalancedChunkMemory

This function returns:

> 0 if the chunk is well balanced, -1 in case of args problem and the parser
> error code otherwise


I see xmlParseBalancedChunkMemoryRecover that might provide the
functionality needed. That function returns:

0 if the chunk is well balanced, -1 in case of args problem and the parser
> error code otherwise In case recover is set to 1, the nodelist will not be
> empty even if the parsed chunk is not well balanced, assuming the parsing
> succeeded to some extent.


I haven't tested yet to see if this parses the data w/ DOCTYPE successfully
yet.  If it does, I don't think it would be difficult to update the check
on res_code to not fail.  I'm making another assumption that there is a
distinct code from libxml to differentiate from other errors, but I
couldn't find those codes quickly.  The current check is this:

if (res_code != 0 || xmlerrcxt->err_occurred)

Does this sound reasonable?  Have I missed some major aspect?  If this is
on the right track I can work on creating a patch to move this forward.

Thanks,

*Ryan Lambert*
RustProof Labs
www.rustprooflabs.com

Reply via email to