Pierre Gramme created NIFI-7790:
-----------------------------------
Summary: XML record reader - failure on well-formed XML
Key: NIFI-7790
URL: https://issues.apache.org/jira/browse/NIFI-7790
Project: Apache NiFi
Issue Type: Bug
Components: Extensions
Affects Versions: 1.11.4
Reporter: Pierre Gramme
Attachments: bug-parse-xml.xml
I am using ConvertRecord in order to parse XML flowfiles to Avro, with the
Infer Schema strategy. Some input flowfiles are sent to the failure output
queue whereas they are well-formed:
{code:java}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
<authors>
<item>
<name>Neil Gaiman</name>
</item>
</authors>
<editors>
<item>
<commercialName>Hachette</commercialName>
</item>
</editors>
</root>
{code}
Note the use of authors/item/name on one side, and editors/item/commercialName
on the other side.
On the other hand, this gets correctly parsed:
{code:java}
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
<authors>
<item>
<name>Neil Gaiman</name>
</item>
</authors>
<editors>
<item>
<name>Hachette</name>
</item>
</editors>
</root>
{code}
See the attached template for minimal reproducible example.
My interpretation is that the failure in the first case is due to 2 independent
XML node types having the same name (<item> in this case) but having different
types and occurring in different parents with different types. In the second
case, both <item>'s actually have the same node type. I didn't use any Schema
Inference Cache, so both item types should be inferred independently.
Since the first document is legal XML (an XSD could be written for it) and can
also be represented in Avro, its conversion shouldn't fail.
I'll be happy to provide more details if needed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)