Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

via GitHub Wed, 03 Jan 2024 21:39:21 -0800


shujingyang-db commented on code in PR #44571:
URL: https://github.com/apache/spark/pull/44571#discussion_r1441349590



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala:
##########
@@ -288,16 +265,17 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: 
Boolean)
             case dt: DataType => dt
           }
           // Add the field and datatypes so that we can check if this is 
ArrayType.
-          val field = StaxXmlParserUtils.getName(e.asStartElement.getName, 
options)
           addOrUpdateType(nameToDataType, field, inferredType)
 
         case c: Characters if !c.isWhiteSpace =>
-          // This can be an attribute-only object
+          // This can be a value tag
           val valueTagType = inferFrom(c.getData)
           addOrUpdateType(nameToDataType, options.valueTag, valueTagType)
 
-        case _: EndElement =>
-          shouldStop = inferAndCheckEndElement(parser)
+        case e: EndElement =>
+          // In case of corrupt records, we shouldn't read beyond the 
EndDocument
+          shouldStop = parser.peek().isInstanceOf[EndDocument] || 
StaxXmlParserUtils
+              .getName(e.getName, options) == startElementName

Review Comment:
   We discussed offline and decided to simplify to `shouldStop = true`. With 
that, we will also make sure that the entire entry, including the starting tag, 
value, and ending tag are all consumed when we complete the parsing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46382][SQL] XML: Refactor the handling of values interspersed between elements [spark]

Reply via email to