[
https://issues.apache.org/jira/browse/DRILL-7954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373110#comment-17373110
]
Paul Rogers edited comment on DRILL-7954 at 7/2/21, 12:13 AM:
--------------------------------------------------------------
[~cgivre] provides a good overview of the issue. Unlike JSON, XML syntax
provides no hints of the expected structure of an element; Drill has to guess,
and has to make that guess looking ahead only one token. This was quite
difficult in JSON (where we at least have the {{[...]}} syntax, and is
intractable for "plain" XML.
In addition to the solutions which Charles mentioned, one could create a custom
parser, one that knows that the {{<field1>}} element is a list. Of course,
rather than hand-coding each schema, if would be better to provide parameters
to a single parser: which is where the XML schema comes in.
One can also go the other way: as Charles noted, Drill has an (obscure)
provided schema feature which says the expected type of each column. This is a
bass-ackward way to specify a schema: if Drill knows that `field1` is a
{{REPEATED VARCHAR}}, then the parser can interpret {{<field1>}} as containing
a list of strings. There are obvious limits, but this is a place to start.
([~cgivre], does the XML parser support a provided schema?)
Finally, one other choice is to use XML attributes to encode structure. I'm
pretty rusty on XML, but I believe there was some standard 20 years ago that
let you give the element type: {{<field1 type="list:string">}} or some such. We
used it heavily in a SOAP API back when dinosaurs roamed... The Drill XML
parser would have to understand the attributes, and your input would have to
include them.
If Drill where to support the XML schema description, it would be best to do so
at plan time, and compile the resulting parser outline into the execution plan.
This way, the (perhaps hundreds) of readers would not all have to do the same
schema downloading, parsing, translation and error reporting. The reader could
even generate Java code to implement the parser to avoid the slow and tedious
interpreter-based code otherwise required.
The bottom line is that, while Drill is "schema-free", that does not mean that
schemas are not needed (they are), it just means that Drill is not well suited
to data that needs a schema, such as XML.
was (Author: paul.rogers):
[~cgivre] provides a good overview of the issue. Unlike JSON, XML syntax
provides no hints of the expected structure of an element; Drill has to guess,
and has to make that guess looking ahead only one token. This was quite
difficult in JSON (where we at least have the `[...]` syntax, and is
intractable for "plain" XML.
In addition to the solutions which Charles mentioned, one could create a custom
parser, one that knows that the `<field1>` element is a list. Of course, rather
than hand-coding each schema, if would be better to provide parameters to a
single parser: which is where the XML schema comes in.
One can also go the other way: as Charles noted, Drill has an (obscure)
provided schema feature which says the expected type of each column. This is a
bass-ackward way to specify a schema: if Drill knows that `field1` is a
`REPEATED VARCHAR`, then the parser can interpret `<field1>` as containing a
list of strings. There are obvious limits, but this is a place to start.
([~cgivre], does the XML parser support a provided schema?)
Finally, one other choice is to use XML attributes to encode structure. I'm
pretty rusty on XML, but I believe there was some standard 20 years ago that
let you give the element type: `<field1 type="list:string">` or some such. We
used it heavily in a SOAP API back when dinosaurs roamed... The Drill XML
parser would have to understand the attributes, and your input would have to
include them.
If Drill where to support the XML schema description, it would be best to do so
at plan time, and compile the resulting parser outline into the execution plan.
This way, the (perhaps hundreds) of readers would not all have to do the same
schema downloading, parsing, translation and error reporting. The reader could
even generate Java code to implement the parser to avoid the slow and tedious
interpreter-based code otherwise required.
The bottom line is that, while Drill is "schema-free", that does not mean that
schemas are not needed (they are), it just means that Drill is not well suited
to data that needs a schema, such as XML.
> XML ability to not concatenate fields and attribute - change presentation of
> data
> ---------------------------------------------------------------------------------
>
> Key: DRILL-7954
> URL: https://issues.apache.org/jira/browse/DRILL-7954
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.19.0
> Reporter: benj
> Priority: Major
>
> With a XML containing these data :
> {noformat}
> <a>
> <attr>
> <set num="0" val="1">x</set>
> <set num="1" val="2">y</set>
> </attr>
> <attr>
> <set num="2" val="a">z</set>
> <set num="3" val="b">a</set>
> </attr>
> </a>
> {noformat}
> {noformat}
> apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml',
> dataLevel=>1)) as x;
> +-----------------------------------------------+----------------+
> | attributes | attr |
> +-----------------------------------------------+----------------+
> | {"attr_set_num":"0123","attr_set_val":"12ab"} | {"set":"xyza"} |
> +-----------------------------------------------+----------------+
> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>2))
> as x;
> +---------------------------------+-----+
> | attributes | set |
> +---------------------------------+-----+
> | {"set_num":"01","set_val":"12"} | xy |
> | {"set_num":"23","set_val":"ab"} | za |
> +---------------------------------+-----+
> apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml',
> dataLevel=>3)) as x;
> +------------+
> | attributes |
> +------------+
> | {} |
> | {} |
> | {} |
> | {} |
> +------------+
> {noformat}
> Attributes and fields with the same name are concatenated and remains
> inexploitable _(maybe the posibility of adding separator should help but it's
> not the point here)_
> In fact that we really need is the ability to obtain something like
> _(depending of the defining level)_ :
> {noformat}
> +----------------------------------------------------------------------------------+
> | attr
> |
> +----------------------------------------------------------------------------------+
> |
> [{"set":"x","_attributes":{"num":"0","val":"1"}},{"set":"y","_attributes":{"num":"1","val":"2"}}]
> |
> |
> [{"set":"z","_attributes":{"num":"2","val":"a"}},{"set":"a","_attributes":{"num":"3","val":"b"}}]
> |
> +----------------------------------------------------------------------------------+
> +------------------------------------------------+
> | set |
> +------------------------------------------------+
> | {"set":"x","_attributes":{"num":"0","val":"1"}} |
> | {"set":"y","_attributes":{"num":"1","val":"2"}} |
> | {"set":"z","_attributes":{"num":"2","val":"a"}} |
> | {"set":"a","_attributes":{"num":"3","val":"b"}} |
> +------------------------------------------------+
> {noformat}
> _attributes fields could be generated on each level instead of generated with
> path from top level => that will allow to work with data from each level
> without losing information
--
This message was sent by Atlassian Jira
(v8.3.4#803005)