[ 
https://issues.apache.org/jira/browse/DRILL-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391657#comment-17391657
 ] 

ASF GitHub Bot commented on DRILL-7979:
---------------------------------------

cgivre commented on pull request #2283:
URL: https://github.com/apache/drill/pull/2283#issuecomment-891119762


   > I started out adding specific, implementation-level comments but I've 
paused that to back off and ask: is this really a _self-closing tag_ thing, or 
is the situation the same for _any empty element_ that also occurs as a parent 
element? In my tests on `master`. the problem is the same for either of the 
following, which I believe are also equivalent in the XML spec.
   > 
   > ```
   > <!-- self-closing -->
   > <foo/>
   > 
   > <!-- just empty -->
   > <foo></foo>
   > ```
   > 
   > If I've got right end of the stick here then I suggest that we adjust all 
the naming to refer to the "empty element" case, rather than the "self-closing" 
case.
   > 
   > Next, following on from our comments on Jira and the idea of using maps 
for this case, what do you think of the following approach?
   > 
   > 1. When our first encounter with an element `foo` is empty, and therefore 
ambiguous in terms of type, we default to the non-leaf case and make it a map.
   > 2. For subsequent parent `foo` elements we return populated maps.  For 
subsequent empty `foo` elements we return empty maps.
   > 3. For subsequent leaf elements `<foo>bar</foo>`, which we would normally 
map to varchar but where we find that we've already got a map from step 1, we 
put the element value into the map under a hardcoded special key, e.g. `{ 
'__value__': 'bar' }`.
   > 
   > The above will also work in the case when the first element encountered is 
empty but has attributes `<foo a='b' />` while the element discarding logic in 
the present patch does not discard such elements. If you're not crazy about 
this it's no problem and I've probably got a couple more specific remarks to 
add on the implementation.
   
   @dzamo Thanks for the response.  The real issue is that we don't know the 
schema as we're scanning the file, so we have to do the best we can.  The issue 
is that with the empty fields (self-closing or otherwise) we don't really know 
what they are until we see real data.  For instance, if we decide to make them 
an empty map, we'll get an error if the next record shows up as a scalar.  The 
current approach was to treat empty fields as scalars which then causes issues 
if we encounter a map in the next row.
   You asked in an other comment about perhaps treating all empty elements in 
the same manner.  There was a specific challenge as to how the self closing 
tags which is why I made this PR.  I'm actually working on another project to 
get the XML reader to download a provided schema (the XSD link) which would 
actually solve a lot of issues reading XML.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Self-Closing XML Tags Cause Schema Change Exceptions
> ----------------------------------------------------
>
>                 Key: DRILL-7979
>                 URL: https://issues.apache.org/jira/browse/DRILL-7979
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Other
>    Affects Versions: 1.19.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>             Fix For: 1.20.0
>
>
> Self closing XML tags are dealt with strangely by java's streaming parser.  
> If you have data where you have one row containing a self closing XML tag foo 
> (<foo/>) but then in the next row `foo` contains a map or other nested field, 
> Drill will throw a schema change exception.  
> This proposed fix causes Drill to ignore self-closing tags unless they have 
> attributes, which allows data like this to be successfully queried.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to