[jira] [Comment Edited] (DRILL-7954) XML ability to not concatenate fields and attribute - change presentation of data

Paul Rogers (Jira) Thu, 01 Jul 2021 17:14:06 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373110#comment-17373110
 ]


Paul Rogers edited comment on DRILL-7954 at 7/2/21, 12:13 AM:
--------------------------------------------------------------

[~cgivre] provides a good overview of the issue. Unlike JSON, XML syntax 
provides no hints of the expected structure of an element; Drill has to guess, 
and has to make that guess looking ahead only one token. This was quite 
difficult in JSON (where we at least have the {{[...]}} syntax, and is 
intractable for "plain" XML. 

In addition to the solutions which Charles mentioned, one could create a custom 
parser, one that knows that the {{<field1>}} element is a list. Of course, 
rather than hand-coding each schema, if would be better to provide parameters 
to a single parser: which is where the XML schema comes in.

One can also go the other way: as Charles noted, Drill has an (obscure) 
provided schema feature which says the expected type of each column. This is a 
bass-ackward way to specify a schema: if Drill knows that `field1` is a 
{{REPEATED VARCHAR}}, then the parser can interpret {{<field1>}} as containing 
a list of strings. There are obvious limits, but this is a place to start. 
([~cgivre], does the XML parser support a provided schema?)

Finally, one other choice is to use XML attributes to encode structure. I'm 
pretty rusty on XML, but I believe there was some standard 20 years ago that 
let you give the element type: {{<field1 type="list:string">}} or some such. We 
used it heavily in a SOAP API back when dinosaurs roamed... The Drill XML 
parser would have to understand the attributes, and your input would have to 
include them.

If Drill where to support the XML schema description, it would be best to do so 
at plan time, and compile the resulting parser outline into the execution plan. 
This way, the (perhaps hundreds) of readers would not all have to do the same 
schema downloading, parsing,  translation and error reporting. The reader could 
even generate Java code to implement the parser to avoid the slow and tedious 
interpreter-based code otherwise required.

The bottom line is that, while Drill is "schema-free", that does not mean that 
schemas are not needed (they are), it just means that Drill is not well suited 
to data that needs a schema, such as XML.


was (Author: paul.rogers):
[~cgivre] provides a good overview of the issue. Unlike JSON, XML syntax 
provides no hints of the expected structure of an element; Drill has to guess, 
and has to make that guess looking ahead only one token. This was quite 
difficult in JSON (where we at least have the `[...]` syntax, and is 
intractable for "plain" XML.


In addition to the solutions which Charles mentioned, one could create a custom 
parser, one that knows that the `<field1>` element is a list. Of course, rather 
than hand-coding each schema, if would be better to provide parameters to a 
single parser: which is where the XML schema comes in.

One can also go the other way: as Charles noted, Drill has an (obscure) 
provided schema feature which says the expected type of each column. This is a 
bass-ackward way to specify a schema: if Drill knows that `field1` is a 
`REPEATED VARCHAR`, then the parser can interpret `<field1>` as containing a 
list of strings. There are obvious limits, but this is a place to start. 
([~cgivre], does the XML parser support a provided schema?)

Finally, one other choice is to use XML attributes to encode structure. I'm 
pretty rusty on XML, but I believe there was some standard 20 years ago that 
let you give the element type: `<field1 type="list:string">` or some such. We 
used it heavily in a SOAP API back when dinosaurs roamed... The Drill XML 
parser would have to understand the attributes, and your input would have to 
include them.

If Drill where to support the XML schema description, it would be best to do so 
at plan time, and compile the resulting parser outline into the execution plan. 
This way, the (perhaps hundreds) of readers would not all have to do the same 
schema downloading, parsing,  translation and error reporting. The reader could 
even generate Java code to implement the parser to avoid the slow and tedious 
interpreter-based code otherwise required.

The bottom line is that, while Drill is "schema-free", that does not mean that 
schemas are not needed (they are), it just means that Drill is not well suited 
to data that needs a schema, such as XML.

> XML ability to not concatenate fields and attribute - change presentation of 
> data
> ---------------------------------------------------------------------------------
>
>                 Key: DRILL-7954
>                 URL: https://issues.apache.org/jira/browse/DRILL-7954
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.19.0
>            Reporter: benj
>            Priority: Major
>
> With a XML containing these data :
> {noformat}
> <a>
>   <attr>
>     <set num="0" val="1">x</set>
>     <set num="1" val="2">y</set>
>   </attr>
>   <attr>
>     <set num="2" val="a">z</set>
>     <set num="3" val="b">a</set>
>   </attr>
> </a>
> {noformat}
> {noformat}
> apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', 
> dataLevel=>1)) as x;
> +-----------------------------------------------+----------------+
> |                  attributes                   |      attr      |
> +-----------------------------------------------+----------------+
> | {"attr_set_num":"0123","attr_set_val":"12ab"} | {"set":"xyza"} |
> +-----------------------------------------------+----------------+
> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>2)) 
> as x;
> +---------------------------------+-----+
> |           attributes            | set |
> +---------------------------------+-----+
> | {"set_num":"01","set_val":"12"} | xy  |
> | {"set_num":"23","set_val":"ab"} | za  |
> +---------------------------------+-----+
> apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', 
> dataLevel=>3)) as x;
> +------------+
> | attributes |
> +------------+
> | {}         |
> | {}         |
> | {}         |
> | {}         |
> +------------+
> {noformat}
> Attributes and fields with the same name are concatenated and remains 
> inexploitable _(maybe the posibility of adding separator should help but it's 
> not the point here)_
> In fact that we really need is the ability to obtain something like 
> _(depending of the defining level)_ :
> {noformat}
> +----------------------------------------------------------------------------------+
> |                                       attr                                  
>      |
> +----------------------------------------------------------------------------------+
> | 
> [{"set":"x","_attributes":{"num":"0","val":"1"}},{"set":"y","_attributes":{"num":"1","val":"2"}}]
>  |
> | 
> [{"set":"z","_attributes":{"num":"2","val":"a"}},{"set":"a","_attributes":{"num":"3","val":"b"}}]
>  |
> +----------------------------------------------------------------------------------+
> +------------------------------------------------+
> |                      set                       |
> +------------------------------------------------+
> | {"set":"x","_attributes":{"num":"0","val":"1"}} |
> | {"set":"y","_attributes":{"num":"1","val":"2"}} |
> | {"set":"z","_attributes":{"num":"2","val":"a"}} |
> | {"set":"a","_attributes":{"num":"3","val":"b"}} |
> +------------------------------------------------+
> {noformat}
> _attributes fields could be generated on each level instead of generated with 
> path from top level => that will allow to work with data from each level 
> without losing information



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (DRILL-7954) XML ability to not concatenate fields and attribute - change presentation of data

Reply via email to