[jira] [Commented] (NIFI-7790) XML record reader - failure on well-formed XML

Pierre Villard (Jira) Tue, 27 Oct 2020 11:27:13 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221645#comment-17221645
 ]


Pierre Villard commented on NIFI-7790:
--------------------------------------

 
{noformat}
2020-10-27 19:18:07,954 ERROR [Event-Driven Process Thread-1] 
o.a.n.processors.standard.ConvertRecord 
ConvertRecord[id=a2eb82b4-ac5c-32f9-062a-194ed4057ecb] Failed to process 
StandardFlowFileRecord[uuid=080f5df4-a920-4a82-b99e-8776db105df7,claim=StandardContentClaim
 [resourceClaim=StandardResourceClaim[id=1603822546154-1, container=default, 
section=1], offset=2956, 
length=227],offset=0,name=080f5df4-a920-4a82-b99e-8776db105df7,size=227]; will 
route to failure: org.apache.avro.SchemaParseException: Can't redefine: 
org.apache.nifi.itemType
org.apache.avro.SchemaParseException: Can't redefine: org.apache.nifi.itemType
        at org.apache.avro.Schema$Names.put(Schema.java:1128)
        at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
        at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
        at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
        at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
        at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
        at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
        at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
        at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
        at org.apache.avro.Schema.toString(Schema.java:324)
        at org.apache.avro.Schema.toString(Schema.java:314)
        at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:144)
        at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:135)
        at 
org.apache.nifi.avro.WriteAvroResultWithSchema.<init>(WriteAvroResultWithSchema.java:45)
        at 
org.apache.nifi.avro.AvroRecordSetWriter.createWriter(AvroRecordSetWriter.java:149)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.nifi.controller.service.StandardControllerServiceInvocationHandler.invoke(StandardControllerServiceInvocationHandler.java:254)
        at 
org.apache.nifi.controller.service.StandardControllerServiceInvocationHandler.invoke(StandardControllerServiceInvocationHandler.java:105)
        at com.sun.proxy.$Proxy376.createWriter(Unknown Source)
        at 
org.apache.nifi.processors.standard.AbstractRecordProcessor$1.process(AbstractRecordProcessor.java:150)
        at 
org.apache.nifi.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2988)
        at 
org.apache.nifi.controller.repository.BatchingSessionFactory$HighThroughputSession.write(BatchingSessionFactory.java:222)
        at 
org.apache.nifi.processors.standard.AbstractRecordProcessor.onTrigger(AbstractRecordProcessor.java:122)
        at 
org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
        at 
org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1174)
        at 
org.apache.nifi.controller.scheduling.EventDrivenSchedulingAgent$EventDrivenTask.trigger(EventDrivenSchedulingAgent.java:354)
        at 
org.apache.nifi.controller.scheduling.EventDrivenSchedulingAgent$EventDrivenTask.run(EventDrivenSchedulingAgent.java:233)
        at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748){noformat}
Your first example is going to generate this schema:

 

 
{noformat}
{
   "type":"record",
   "name":"nifiRecord",
   "namespace":"org.apache.nifi",
   "fields":[
      {
         "name":"authors",
         "type":[
            "null",
            {
               "type":"record",
               "name":"authorsType",
               "fields":[
                  {
                     "name":"item",
                     "type":[
                        "null",
                        {
                           "type":"record",
                           "name":"itemType",
                           "fields":[
                              {
                                 "name":"name",
                                 "type":[
                                    "null",
                                    "string"
                                 ]
                              }
                           ]
                        }
                     ]
                  }
               ]
            }
         ]
      },
      {
         "name":"editors",
         "type":[
            "null",
            {
               "type":"record",
               "name":"editorsType",
               "fields":[
                  {
                     "name":"item",
                     "type":[
                        "null",
                        "itemType"
                     ]
                  }
               ]
            }
         ]
      }
   ]
}{noformat}
In your second example the "item" record is changing and this is not allowed in 
the current form of the processor (because we use the same "name" field). What 
I would recommend is to provide a schema instead using the schema inference. 
The below schema would work with your data for the first example you gave:

 

 
{noformat}
{
   "type":"record",
   "name":"nifiRecord",
   "namespace":"org.apache.nifi",
   "fields":[
      {
         "name":"authors",
         "type":[
            "null",
            {
               "type":"record",
               "name":"authorsType",
               "fields":[
                  {
                     "name":"item",
                     "type":[
                        "null",
                        {
                           "type":"record",
                           "name":"itemType1",
                           "fields":[
                              {
                                 "name":"name",
                                 "type":[
                                    "null",
                                    "string"
                                 ]
                              }
                           ]
                        }
                     ]
                  }
               ]
            }
         ]
      },
      {
         "name":"editors",
         "type":[
            "null",
            {
               "type":"record",
               "name":"editorsType",
               "fields":[
                  {
                     "name":"item",
                     "type":[
                        "null",
                        {
                           "type":"record",
                           "name":"itemType2",
                           "fields":[
                              {
                                 "name":"commercialName",
                                 "type":[
                                    "null",
                                    "string"
                                 ]
                              }
                           ]
                        }
                     ]
                  }
               ]
            }
         ]
      }
   ]
}{noformat}
We could definitely improve things though to support such cases but overall 
it's always better to provide a schema.

 

> XML record reader - failure on well-formed XML
> ----------------------------------------------
>
>                 Key: NIFI-7790
>                 URL: https://issues.apache.org/jira/browse/NIFI-7790
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.11.4
>            Reporter: Pierre Gramme
>            Priority: Major
>              Labels: records, xml
>         Attachments: bug-parse-xml.xml
>
>
> I am using ConvertRecord in order to parse XML flowfiles to Avro, with the 
> Infer Schema strategy. Some input flowfiles are sent to the failure output 
> queue whereas they are well-formed: 
> {code:java}
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <root>
>       <authors>
>               <item>
>                       <name>Neil Gaiman</name>
>               </item>
>       </authors>
>       <editors>
>               <item>
>                       <commercialName>Hachette</commercialName>
>               </item>
>       </editors>
> </root>
> {code}
> Note the use of authors/item/name on one side, and 
> editors/item/commercialName on the other side.
> On the other hand, this gets correctly parsed: 
> {code:java}
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <root>
>       <authors>
>               <item>
>                       <name>Neil Gaiman</name>
>               </item>
>       </authors>
>       <editors>
>               <item>
>                       <name>Hachette</name>
>               </item>
>       </editors>
> </root>
> {code}
> See the attached template for minimal reproducible example.
>  
> My interpretation is that the failure in the first case is due to 2 
> independent XML node types having the same name (<item> in this case) but 
> having different types and occurring in different parents with different 
> types. In the second case, both <item>'s actually have the same node type. I 
> didn't use any Schema Inference Cache, so both item types should be inferred 
> independently. 
> Since the first document is legal XML (an XSD could be written for it) and 
> can also be represented in Avro, its conversion shouldn't fail.
> I'll be happy to provide more details if needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NIFI-7790) XML record reader - failure on well-formed XML

Reply via email to