Charles Givre created DRILL-7823:
------------------------------------

             Summary: Add XML Format Plugin
                 Key: DRILL-7823
                 URL: https://issues.apache.org/jira/browse/DRILL-7823
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.17.0
            Reporter: Charles Givre
            Assignee: Charles Givre
             Fix For: 1.19.0


# XML Format Reader
This plugin enables Drill to read XML files without defining any kind of schema.

## Configuration
Aside from the file extension, there is one configuration option:

* `dataLevel`: XML data often contains a considerable amount of nesting which 
is not necesarily useful for data analysis. This parameter allows you to set 
the nesting level 
  where the data actually starts.  The levels start at `1`.

The default configuration is shown below:

```json
"xml": {
  "type": "xml",
  "extensions": [
    "xml"
  ],
  "dataLevel": 2
}
```

## Data Types
All fields are read as strings.  Nested fields are read as maps.  Future 
functionality could include support for lists.

## Limitations: Schema Ambiguity
XML is a challenging format to process as the structure does not give any hints 
about the schema.  For example, a JSON file might have the following record:

```json
"record" : {
  "intField:" : 1,
  "listField" : [1, 2],
  "otherField" : {
    "nestedField1" : "foo",
    "nestedField2" : "bar"
  }
}
```

>From this data, it is clear that `listField` is a `list` and `otherField` is a 
>map.  This same data could be represented in XML as follows:

```xml
<record>
  <intField>1</intField>
  <listField>
    <value>1</value>
    <value>2</value>
  </listField>
  <otherField>
    <nestedField1>foo</nestedField1>
    <nestedField2>bar</nestedField2>
  </otherField>
</record>
```
This is no problem to parse this data. But consider what would happen if we 
encountered the following first:
```xml
<record>
  <intField>1</intField>
  <listField>
    <value>2</value>
  </listField>
  <otherField>
    <nestedField1>foo</nestedField1>
    <nestedField2>bar</nestedField2>
  </otherField>
</record>
```
In this example, there is no way for Drill to know whether `listField` is a 
`list` or a `map` because it only has one entry. 

## Future Functionality

* **Build schema from XSD file or link**:  One of the major challenges of this 
reader is having to infer the schema of the data. XML files do provide a schema 
although this is not
 required.  In the future, if there is interest, we can extend this reader to 
use an XSD file to build the schema which will be used to parse the actual XML 
file. 
  
* **Infer Date Fields**: It may be possible to add the ability to infer data 
fields.

* **List Support**:  Future functionality may include the ability to infer 
lists from data structures.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to