shujingyang-db commented on PR #43319: URL: https://github.com/apache/spark/pull/43319#issuecomment-1758122788
@srowen Thanks for bringing up this case! In the case of ``` <book>Great <foo>Book</foo> to read!</book> ``` If `rowTag` is `book`, the resulting schema will be `foo STRING` and skip the characters `Great` and ` to read!` in between the tags. This is because we only read values in between tags if the struct _only_ consists of attributes and valueTag. Our definition to the `valueTag` is: > valueTag: The tag used for the value when there are attributes in the element having no child. Default is _VALUE. If there's another field, the value in between the tags will be ignored. If the user specifies the schema to be `foo STRING, _VALUE STRING`, we will also ignore the value and leave `_VALUE` empty. ([current behavior](https://github.com/shujingyang-db/spark/blob/8fd10a40641c831155ffd644e331f0b818f72700/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala#L198)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
