[ https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421441#comment-16421441 ]
ASF GitHub Bot commented on NIFI-4185: -------------------------------------- Github user pvillard31 commented on a diff in the pull request: https://github.com/apache/nifi/pull/2587#discussion_r178437056 --- Diff: nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/resources/docs/org.apache.nifi.xml.XMLReader/additionalDetails.html --- @@ -0,0 +1,378 @@ +<!DOCTYPE html> +<html lang="en"> + <!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + --> + <head> + <meta charset="utf-8"/> + <title>XMLReader</title> + <link rel="stylesheet" href="../../../../../css/component-usage.css" type="text/css"/> + </head> + + <body> + <p> + The XMLReader Controller Service reads XML content and creates Record objects. The Controller Service + must be configured with a schema that describes the structure of the XML data. Fields in the XML data + that are not defined in the schema will be skipped. + </p> + <p> + Records are expected in the second level of the XML data, embedded within an enclosing root tag: + </p> + <code> + <pre> + <root> + <record> + <field1>content</field1> + <field2>content</field2> + </record> + <record> + <field1>content</field1> + <field2>content</field2> + </record> + </root> + </pre> + </code> + + <p> + For the following examples, it is assumed that the exemplary records are enclosed by a root tag. + </p> + + <h2>Example 1: Simple Fields</h2> + + <p> + The simplest kind of data within XML data are tags / fields only containing content (no attributes, no embedded tags). + They can be described in the schema by simple types (e. g. INT, STRING, ...). + </p> + + <code> + <pre> + <record> + <simple_field>content</simple_field> + </record> + </pre> + </code> + + <p> + This record can be described by a schema containing one field (e. g. of type string). By providing this schema, + the reader expects zero or one occurrences of "simple_field" in the record. + </p> + + <code> + <pre> + { + "namespace": "nifi", + "name": "test", + "type": "record", + "fields": [ + { "name": "simple_field", "type": "string" } + ] + } + </pre> + </code> + + <h2>Example 2: Arrays with Simple Fields</h2> + + <p> + Arrays are considered as repetitive tags / fields in XML data. For the following XML data, "array_field" is considered + to be an array enclosing simple fields, whereas "simple_field" is considered to be a simple field not enclosed in + an array. + </p> + + <code> + <pre> + <record> + <array_field>content</array_field> + <array_field>content</array_field> + <simple_field>content</simple_field> + </record> + </pre> + </code> + + <p> + This record can be described by the following schema: + </p> + + <code> + <pre> + { + "namespace": "nifi", + "name": "test", + "type": "record", + "fields": [ + { "name": "array_field", "type": + { "type": "array", "items": string } + }, + { "name": "simple_field", "type": "string" } + ] + } + </pre> + </code> + + <p> + If a field in a schema is embedded in an array, the reader expects zero, one or more occurrences of the field + in a record. The field "array_field" principally also could be defined as a simple field, but then the second occurrence + of this field would replace the first in the record object. Moreover, the field "simple_field" could also be defined + as an array. In this case, the reader would put it into the record object as an array with one element. + </p> + + <h2>Example 3: Tags with Attributes</h2> + + <p> + XML fields frequently not only contain content, but also attributes. The following record contains a field with + an attribute "attr" and content: + </p> + + <code> + <pre> + <record> + <field_with_attribute attr="attr_content">content of field</field_with_attribute> + </record> + </pre> + </code> + + <p> + To parse the content of the field "field_with_attribute" together with the attribute "attr", two requirements have + to be fulfilled: + </p> + + <ul> + <li>In the schema, the field has to be defined as record.</li> + <li>The property "Field Name for Content" has to be set.</li> + <li>As an option, the property "Attribute Prefix" also can be set.</li> + </ul> + + <p> + For the example above, the following property settings are assumed: + </p> + + <table> + <tr> + <th>Property Name</th> + <th>Property Value</th> + </tr> + <tr> + <td>Field Name for Content</td> + <td><code>field_name_for_content</code></td> + </tr> + <tr> + <td>Attribute Prefix</td> + <td><code>prefix_</code></td> + </tr> + </table> + + <p> + The schema can be defined as follows: + </p> + + <code> + <pre> + { + "name": "test", + "namespace": "nifi", + "type": "record", + "fields": [ + { + "name": "field_with_attribute", + "type": { + "name": "RecordForTag", + "type": "record", + "fields" : [ + {"name": "attr", "type": "string"}, + {"name": "field_name_for_content", "type": "string"} + ] + } + ] + } + </pre> + </code> + + <p> + Note that the field "field_name_for_content" not only has to be defined in the property section, but also in the + schema, whereas the prefix for attributes is not part of the schema. It will be appended when an attribute named + "attr" is found at the respective position in the XML data and added to the record. The record object of the above + example will be structured as follows: + </p> + + <code> + <pre> + Record ( + Record "field_with_attribute" ( + RecordField "prefix_attr" = "attr_content", + RecordField "field_name_for_content" = "content of field" + ) + ) + </pre> + </code> + + <p> + Principally, the field "field_with_attribute" could also be defined as a simple field. In this case, the attributes + simply would be ignored. Vice versa, the simple field in example 1 above could also be defined as a record (assuming that + the property "Field Name for Content" is set. + </p> + + <h2>Example 4: Tags within tags</h2> + + <p> + XML data is frequently nested. In this case, tags enclose other tags: + </p> + <code> + <pre> + <record> + <field_with_embedded_fields attr="attr_content"> + <embedded_field>embedded content</embedded_field> + <another_embedded_field>another embedded content</another_embedded_field> + </field_with_embedded_fields> + </record> + </pre> + </code> + + <p> + The enclosing fields always have to be defined as records, irrespective whether they include attributes to be + parsed or not. In this example, the tag "field_with_embedded_fields" encloses the fields "embedded_field" and + "another_embedded_field", which are both simple fields. The schema can be defined as follows: + </p> + + <code> + <pre> + { + "name": "test", + "namespace": "nifi", + "type": "record", + "fields": [ + { + "name": "field_with_embedded_fields", + "type": { + "name": "RecordForEmbedded", + "type": "record", + "fields" : [ + {"name": "attr", "type": "string"}, + {"name": "embedded_field", "type": "string"} + {"name": "another_embedded_field", "type": "string"} + ] + } + ] + } + </pre> + </code> + + <p> + Notice that this case does not require the property "Field Name for Content" to be set as this is only required + for tags containing attributes and content. + </p> + + <h2>Example 5: Array of records</h2> + + <p> + For further explanation of the logic of this reader, an example of an array of records shall be demonstrated. + The following record contains the field "array_element", which repeatedly occurs. The field contains two --- End diff -- typo: ``contains the field "array_field"`` > Add XML record reader & writer services > --------------------------------------- > > Key: NIFI-4185 > URL: https://issues.apache.org/jira/browse/NIFI-4185 > Project: Apache NiFi > Issue Type: New Feature > Components: Extensions > Affects Versions: 1.3.0 > Reporter: Andy LoPresto > Assignee: Johannes Peter > Priority: Major > Labels: json, records, xml > > With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, > XML conversion has not yet been targeted. This will replace the previous > ticket for XML to JSON conversion. -- This message was sent by Atlassian JIRA (v7.6.3#76005)