[ 
https://issues.apache.org/jira/browse/NIFI-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224872#comment-17224872
 ] 

Pierre Gramme commented on NIFI-7790:
-------------------------------------

Thanks for the detailed feedback !

I agree that providing a schema is definitely the most robust option. I was 
actually hoping to get a first version of the schema inferred from the XML 
records, that I would then refine manually.

This was just a minimal reproducible example. In my use case, I have an XSD 
schema for the input XML, but no Avro. This schema is quite big and complex, 
involving enums, min/max values, abstract classes, etc. So manually converting 
it to Avro schema seems a bad option, initially time-consuming and later hard 
to maintain.

Comments under your [blog 
post|https://pierrevillard.com/2018/06/28/nifi-1-7-xml-reader-writer-and-forkrecord-processor/]
 suggested that a XSD-based parser might be on its way. But after reading 
comments in NIFI-4185, I don't think it is possible to specify the input schema 
as XSD, is it?

If not, I will investigate the following method, using JAXB to convert XSD -> 
Java classes -> Avro schema (code under Apache 2 licence):
 
[https://github.com/mit-ll/xml-avro-converter/blob/master/doc/tutorial.md#conversion-of-xml-schemas-and-data]

Or would you suggest some other automated way of converting the XSD to Avro?

> XML record reader - failure on well-formed XML
> ----------------------------------------------
>
>                 Key: NIFI-7790
>                 URL: https://issues.apache.org/jira/browse/NIFI-7790
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.11.4
>            Reporter: Pierre Gramme
>            Priority: Major
>              Labels: records, xml
>         Attachments: bug-parse-xml.xml
>
>
> I am using ConvertRecord in order to parse XML flowfiles to Avro, with the 
> Infer Schema strategy. Some input flowfiles are sent to the failure output 
> queue whereas they are well-formed: 
> {code:java}
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <root>
>       <authors>
>               <item>
>                       <name>Neil Gaiman</name>
>               </item>
>       </authors>
>       <editors>
>               <item>
>                       <commercialName>Hachette</commercialName>
>               </item>
>       </editors>
> </root>
> {code}
> Note the use of authors/item/name on one side, and 
> editors/item/commercialName on the other side.
> On the other hand, this gets correctly parsed: 
> {code:java}
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <root>
>       <authors>
>               <item>
>                       <name>Neil Gaiman</name>
>               </item>
>       </authors>
>       <editors>
>               <item>
>                       <name>Hachette</name>
>               </item>
>       </editors>
> </root>
> {code}
> See the attached template for minimal reproducible example.
>  
> My interpretation is that the failure in the first case is due to 2 
> independent XML node types having the same name (<item> in this case) but 
> having different types and occurring in different parents with different 
> types. In the second case, both <item>'s actually have the same node type. I 
> didn't use any Schema Inference Cache, so both item types should be inferred 
> independently. 
> Since the first document is legal XML (an XSD could be written for it) and 
> can also be represented in Avro, its conversion shouldn't fail.
> I'll be happy to provide more details if needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to