[ 
https://issues.apache.org/jira/browse/DRILL-7954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17370696#comment-17370696
 ] 

Charles Givre commented on DRILL-7954:
--------------------------------------

[~benj641] Thanks for the JIRA. As the author of the XML plugin, let me explain 
a bit as this was an issue I encountered when I was developing the plugin.

If you take a look at the docs[1], you'll see at the bottom a section on known 
limitations and in that you'll see a bullet "List Support". This issue is 
actually describing that limitation.

Why is it a limitation?
 The issue is that Drill doesn't know the schema before we start reading the 
data. The secondary issue is that XML is ambiguous by nature.

Consider the data below:

{{<row>}}
{{  <field1>}}
{{    <foo>value1</foo>}}
{{  </field1>}}
{{</row>}}
{{ <row>}}
{{  <field1>}}
{{    <foo>value2</foo>}}
{{    <foo>value3></foo>}}
{{   </field1>}}
{{ </row>}}

In this case, Drill first sees the field foo and interprets this as a string, 
creates a memory vector and all is well.  In the second row, Drill has already 
established a memory vector for column foo that contains single strings, even 
though what we should have is a list and writes the data anyway.  The issue is 
that when Drill sees the first column called foo, it has no way of knowing that 
there are future entries that should be lists, because to quote [~paul-rogers] 
"Drill cannot predict the future".  

There are a few possible solutions:
 #  Use an XSD as schema  This represents the best way of handling this case.  
Since XML documents frequently provide a schema in the form of an XSD link at 
the top, one option would be to have Drill automatically pull back the XSD 
document (and ideally cache it) use that to build the schema, and then parse 
the data accordingly. 
 # Provide a schema file:  The next-best approach would be to create a schema 
file and use this as a provided schema file for the data.  This functionality 
**should** be available in Drill although I'm not sure that the XML plugin can 
read the provided schema.
 # Add the ability to interpret lists on the fly:  This is the arguably the 
most complicated and there are a lot of edge cases here.  The fundamental 
problem is that XML is ambiguous.  

 

[1]: [https://github.com/apache/drill/tree/master/contrib/format-xml]

> XML ability to not concatenate fields and attribute - change presentation of 
> data
> ---------------------------------------------------------------------------------
>
>                 Key: DRILL-7954
>                 URL: https://issues.apache.org/jira/browse/DRILL-7954
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.19.0
>            Reporter: benj
>            Priority: Major
>
> With a XML containing these data :
> {noformat}
> <a>
>   <attr>
>     <set num="0" val="1">x</set>
>     <set num="1" val="2">y</set>
>   </attr>
>   <attr>
>     <set num="2" val="a">z</set>
>     <set num="3" val="b">a</set>
>   </attr>
> </a>
> {noformat}
> {noformat}
> apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', 
> dataLevel=>1)) as x;
> +-----------------------------------------------+----------------+
> |                  attributes                   |      attr      |
> +-----------------------------------------------+----------------+
> | {"attr_set_num":"0123","attr_set_val":"12ab"} | {"set":"xyza"} |
> +-----------------------------------------------+----------------+
> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>2)) 
> as x;
> +---------------------------------+-----+
> |           attributes            | set |
> +---------------------------------+-----+
> | {"set_num":"01","set_val":"12"} | xy  |
> | {"set_num":"23","set_val":"ab"} | za  |
> +---------------------------------+-----+
> apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', 
> dataLevel=>3)) as x;
> +------------+
> | attributes |
> +------------+
> | {}         |
> | {}         |
> | {}         |
> | {}         |
> +------------+
> {noformat}
> Attributes and fields with the same name are concatenated and remains 
> inexploitable _(maybe the posibility of adding separator should help but it's 
> not the point here)_
> In fact that we really need is the ability to obtain something like 
> _(depending of the defining level)_ :
> {noformat}
> +----------------------------------------------------------------------------------+
> |                                       attr                                  
>      |
> +----------------------------------------------------------------------------------+
> | 
> [{"set":"x","_attributes":{"num":"0","val":"1"}},{"set":"y","_attributes":{"num":"1","val":"2"}}]
>  |
> | 
> [{"set":"z","_attributes":{"num":"2","val":"a"}},{"set":"a","_attributes":{"num":"3","val":"b"}}]
>  |
> +----------------------------------------------------------------------------------+
> +------------------------------------------------+
> |                      set                       |
> +------------------------------------------------+
> | {"set":"x","_attributes":{"num":"0","val":"1"}} |
> | {"set":"y","_attributes":{"num":"1","val":"2"}} |
> | {"set":"z","_attributes":{"num":"2","val":"a"}} |
> | {"set":"a","_attributes":{"num":"3","val":"b"}} |
> +------------------------------------------------+
> {noformat}
> _attributes fields could be generated on each level instead of generated with 
> path from top level => that will allow to work with data from each level 
> without losing information



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to