shujingyang-db commented on PR #43319:
URL: https://github.com/apache/spark/pull/43319#issuecomment-1758122788

   @srowen Thanks for bringing up this case! In the case of
   ```
   <book>Great <foo>Book</foo> to read!</book>
   ```
   If `rowTag` is `book`, the resulting schema will be `foo STRING` and skip 
the characters `Great` and ` to read!` in between the tags. This is because we 
only read values in between tags if the struct _only_ consists of attributes 
and valueTag. Our definition to the `valueTag` is: 
   > valueTag: The tag used for the value when there are attributes in the 
element having no child. Default is _VALUE.
   
   If there's another field, the value in between the tags will be ignored.
   
   If the user specifies the schema to be `foo STRING, _VALUE STRING`, we will 
also ignore the value and leave `_VALUE` empty. ([current 
behavior](https://github.com/shujingyang-db/spark/blob/8fd10a40641c831155ffd644e331f0b818f72700/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala#L198))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to