[ 
https://issues.apache.org/jira/browse/DRILL-7823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vova Vysotskyi updated DRILL-7823:
----------------------------------
    Labels: ready-to-commit  (was: )

> Add XML Format Plugin
> ---------------------
>
>                 Key: DRILL-7823
>                 URL: https://issues.apache.org/jira/browse/DRILL-7823
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.17.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>              Labels: ready-to-commit
>             Fix For: 1.19.0
>
>
> # XML Format Reader
> This plugin enables Drill to read XML files without defining any kind of 
> schema.
> ## Configuration
> Aside from the file extension, there is one configuration option:
> * `dataLevel`: XML data often contains a considerable amount of nesting which 
> is not necesarily useful for data analysis. This parameter allows you to set 
> the nesting level 
>   where the data actually starts.  The levels start at `1`.
> The default configuration is shown below:
> ```json
> "xml": {
>   "type": "xml",
>   "extensions": [
>     "xml"
>   ],
>   "dataLevel": 2
> }
> ```
> ## Data Types
> All fields are read as strings.  Nested fields are read as maps.  Future 
> functionality could include support for lists.
> ## Limitations: Schema Ambiguity
> XML is a challenging format to process as the structure does not give any 
> hints about the schema.  For example, a JSON file might have the following 
> record:
> ```json
> "record" : {
>   "intField:" : 1,
>   "listField" : [1, 2],
>   "otherField" : {
>     "nestedField1" : "foo",
>     "nestedField2" : "bar"
>   }
> }
> ```
> From this data, it is clear that `listField` is a `list` and `otherField` is 
> a map.  This same data could be represented in XML as follows:
> ```xml
> <record>
>   <intField>1</intField>
>   <listField>
>     <value>1</value>
>     <value>2</value>
>   </listField>
>   <otherField>
>     <nestedField1>foo</nestedField1>
>     <nestedField2>bar</nestedField2>
>   </otherField>
> </record>
> ```
> This is no problem to parse this data. But consider what would happen if we 
> encountered the following first:
> ```xml
> <record>
>   <intField>1</intField>
>   <listField>
>     <value>2</value>
>   </listField>
>   <otherField>
>     <nestedField1>foo</nestedField1>
>     <nestedField2>bar</nestedField2>
>   </otherField>
> </record>
> ```
> In this example, there is no way for Drill to know whether `listField` is a 
> `list` or a `map` because it only has one entry. 
> ## Future Functionality
> * **Build schema from XSD file or link**:  One of the major challenges of 
> this reader is having to infer the schema of the data. XML files do provide a 
> schema although this is not
>  required.  In the future, if there is interest, we can extend this reader to 
> use an XSD file to build the schema which will be used to parse the actual 
> XML file. 
>   
> * **Infer Date Fields**: It may be possible to add the ability to infer data 
> fields.
> * **List Support**:  Future functionality may include the ability to infer 
> lists from data structures.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to