[GitHub] [drill] cgivre opened a new pull request #2129: DRILL-7823 - Add XML Format Plugin

GitBox Thu, 17 Dec 2020 13:17:02 -0800


cgivre opened a new pull request #2129:
URL: https://github.com/apache/drill/pull/2129



   # [DRILL-7823](https://issues.apache.org/jira/browse/DRILL-7823): Add XML 
Format Plugin
   
   ## Description
   
   Adds XML format plugin to Drill.
   
   ## Documentation
   # XML Format Reader
   This plugin enables Drill to read XML files without defining any kind of 
schema.
   
   ## Configuration
   Aside from the file extension, there is one configuration option:
   
   * `dataLevel`: XML data often contains a considerable amount of nesting 
which is not necesarily useful for data analysis. This parameter allows you to 
set the nesting level 
     where the data actually starts.  The levels start at `1`.
   
   The default configuration is shown below:
   
   ```json
   "xml": {
     "type": "xml",
     "extensions": [
       "xml"
     ],
     "dataLevel": 2
   }
   ```
   
   ## Data Types
   All fields are read as strings.  Nested fields are read as maps.  Future 
functionality could include support for lists.
   
   ## Limitations: Schema Ambiguity
   XML is a challenging format to process as the structure does not give any 
hints about the schema.  For example, a JSON file might have the following 
record:
   
   ```json
   "record" : {
     "intField:" : 1,
     "listField" : [1, 2],
     "otherField" : {
       "nestedField1" : "foo",
       "nestedField2" : "bar"
     }
   }
   ```
   
   From this data, it is clear that `listField` is a `list` and `otherField` is 
a map.  This same data could be represented in XML as follows:
   
   ```xml
   <record>
     <intField>1</intField>
     <listField>
       <value>1</value>
       <value>2</value>
     </listField>
     <otherField>
       <nestedField1>foo</nestedField1>
       <nestedField2>bar</nestedField2>
     </otherField>
   </record>
   ```
   This is no problem to parse this data. But consider what would happen if we 
encountered the following first:
   ```xml
   <record>
     <intField>1</intField>
     <listField>
       <value>2</value>
     </listField>
     <otherField>
       <nestedField1>foo</nestedField1>
       <nestedField2>bar</nestedField2>
     </otherField>
   </record>
   ```
   In this example, there is no way for Drill to know whether `listField` is a 
`list` or a `map` because it only has one entry. 
   
   ## Future Functionality
   
   * **Build schema from XSD file or link**:  One of the major challenges of 
this reader is having to infer the schema of the data. XML files do provide a 
schema although this is not
    required.  In the future, if there is interest, we can extend this reader 
to use an XSD file to build the schema which will be used to parse the actual 
XML file. 
     
   * **Infer Date Fields**: It may be possible to add the ability to infer data 
fields.
   
   * **List Support**:  Future functionality may include the ability to infer 
lists from data structures.  
   
   ## Testing
   The PR includes about 15 unit tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [drill] cgivre opened a new pull request #2129: DRILL-7823 - Add XML Format Plugin

Reply via email to