[jira] [Commented] (SPARK-45414) spark-xml misplaces string tag content

Ritika Maheshwari (Jira) Tue, 06 Feb 2024 22:17:04 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815121#comment-17815121
 ]


Ritika Maheshwari commented on SPARK-45414:
-------------------------------------------

@Sean Owen 

What is causing this issue is this piece of code 

 
{code:java}
com.databricks.spark.xml.parsers.StaxXMLGenerator  
def writeElement {....
val (names, values) = elements.unzip
val elementSchema = StructType(schema.filter(names.contains))
val elementRow = Row.fromSeq(row.toSeq.filter(values.contains))


{code}
 

In this code elements contains only elements, since attributes have already 
been handled before but row contains values for both attributes and elements. 
In this sql the attribute  "SerialNumberFlag" has value ''. But other elements 
like Color also has value ''. Therefore the line of code in red will also get 
the value for "SerialNumberFlag" where as the elementSchema will not have the 
schema for that attribute but instead would have schema for 
MerchandiseHierarchy. Hence MerchandiseHierarchy gets the value '' causing the 
issue.

Here are the values we get for elementSchema and elementRow which show the 
mismatch

 
{code:java}
24/02/06 22:05:38 INFO StaxXmlGenerator$: In apply elementSchema  is 
StructField(ItemID,StructType(StructField(_Type,StringType,true),StructField(_VALUE,StringType,true)),false),StructField(UPC,StructType(StructField(_Type,StringType,true),StructField(_VALUE,StringType,true)),false),StructField(Description,StringType,false),StructField(MerchandiseHierarchy,ArrayType(StructType(StructField(_ID,StringType,true),StructField(_Level,StringType,true)),true),false),StructField(ItemPrice,ArrayType(StructType(StructField(_ValueTypeCode,StringType,true),StructField(_Value,StringType,true)),true),false),StructField(Color,StringType,false),StructField(IntendedIndustry,StringType,false),StructField(Manufacturer,StructType(StructField(Name,StringType,true)),false),StructField(Marketing,StructType(StructField(Season,StringType,true)),false),StructField(BrandOwner,StructType(StructField(_Name,StringType,true)),false),StructField(ItemAttribute_culinary,ArrayType(StructType(StructField(_Name,StringType,true),StructField(AttributeValue,StringType,true)),true),false),StructField(ItemAttribute_noculinary,ArrayType(StructType(StructField(_Name,StringType,true),StructField(_VALUE,ArrayType(ArrayType(StructType(StructField(AttributeCode,StringType,true),StructField(AttributeValue,StringType,true)),true),true),true)),true),false),StructField(ItemMeasurements,StructType(StructField(Depth,StructType(StructField(_UnitOfMeasure,StringType,true),StructField(_VALUE,StringType,true)),true),StructField(Height,StructType(StructField(_UnitOfMeasure,StringType,true),StructField(_VALUE,StringType,true)),true),StructField(Width,StructType(StructField(_UnitOfMeasure,StringType,true),StructField(_VALUE,StringType,true)),true),StructField(Diameter,StructType(StructField(_UnitOfMeasure,StringType,true),StructField(_VALUE,StringType,true)),true)),false),StructField(TaxInformation,StructType(StructField(TaxGroupID,StringType,true),StructField(TaxExemptCode,StringType,true),StructField(TaxAmount,StringType,true)),false),StructField(ItemImageUrl,StringType,false),StructField(ItemFranchisees,ArrayType(ArrayType(StructType(StructField(_action,StringType,true),StructField(_franchiseeId,StringType,true),StructField(_franchiseeName,StringType,true)),true),true),false)

24/02/06 22:05:38 INFO StaxXmlGenerator$:  In apply the elementRow is 
[ItemId,123],[UPC,123],MyDescription,,WrappedArray([null,null]),WrappedArray([null,null]),,,[null],[null],[null],WrappedArray([Attribute1,Value......
 


{code}

> spark-xml misplaces string tag content
> --------------------------------------
>
>                 Key: SPARK-45414
>                 URL: https://issues.apache.org/jira/browse/SPARK-45414
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 3.3.0
>            Reporter: Giuseppe Ceravolo
>            Priority: Critical
>         Attachments: IllegalArgumentException.txt, Screen Shot 2024-02-01 at 
> 7.46.29 PM.png
>
>
> h1. Intro
> Hi all! Please expect some degree of incompleteness in this issue as this is 
> the very first one I post, and feel free to edit it as you like - I welcome 
> your feedback.
> My goal is to provide you with as many details and indications as I can on 
> this issue that I am currently facing with a Client of mine on its Production 
> environment (we use Azure Databricks DBR 11.3 LTS).
> I was told by Sean Owen [[srowen (Sean Owen) 
> (github.com)|https://github.com/srowen]], who maintains the spark-xml maven 
> repository on GitHub [[https://github.com/srowen/spark-xml]] to post an issue 
> here because "This code has been ported to Apache Spark now anyway so won't 
> be updated here" (refer to his comment [here|#issuecomment-1744792958]).
> h1. Issue
> When I write a DataFrame into xml format via the spark-xml library either (1) 
> I get an error if empty string columns are in between non-string nested ones 
> or (2) if I put all string columns at the end then I get a wrong xml where 
> the content of string tags are misplaced into the following ones.
> h1. Code to reproduce the issue
> Please find below the end-to-end code snippet that results into the error
> h2. CASE (1): ERROR
> When empty strings are in between non-string nested ones, the write fails 
> with the following error.
> _Caused by: java.lang.IllegalArgumentException: Failed to convert value 
> MyDescription (class of class java.lang.String) in type 
> ArrayType(StructType(StructField(_ID,StringType,true),StructField(_Level,StringType,true)),true)
>  to XML._
> Please find attached the full trace of the error.
> {code:python}
> fake_file_df = spark \
>     .sql(
>         """SELECT
>             CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS 
> STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID,
>             CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: 
> STRING, _VALUE: STRING>) AS UPC,
>             CAST('' AS STRING) AS _SerialNumberFlag,
>             CAST('MyDescription' AS STRING) AS Description,
>             CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS 
> ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy,
>             CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS 
> ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice,
>             CAST('' AS STRING) AS Color,
>             CAST('' AS STRING) AS IntendedIndustry,
>             CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS 
> Manufacturer,
>             CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS 
> Marketing,
>             CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS 
> BrandOwner,
>             CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) 
> AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS 
> ItemAttribute_culinary,
>             CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS 
> `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS 
> ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, 
> AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary,
>             CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS 
> `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, 
> STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS 
> `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: 
> STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: 
> STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: 
> STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: 
> STRING>>) AS ItemMeasurements,
>             CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, 
> '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, 
> TaxAmount: STRING>) AS TaxInformation,
>             CAST('' AS STRING) AS ItemImageUrl,
>             CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS 
> `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: 
> STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees,
>             CAST('Add' AS STRING) AS _Action
>         ;"""
>     )
> # fake_file_df.display()
> fake_file_df \
>     .coalesce(1) \
>     .write \
>     .format('com.databricks.spark.xml') \
>     .option('declaration', 'version="1.0" encoding="UTF-8"') \
>     .option("nullValue", "") \
>     .option('rootTag', "root_tag") \
>     .option('rowTag', "row_tag") \
>     .mode('overwrite') \
>     .save(xml_folder_path) {code}
> I noticed that it works if I try to write all columns up to "Color" 
> (excluded), namely:
> {code:python}
> fake_file_df \
>     .select(
>         "ItemID",
>         "UPC",
>         "_SerialNumberFlag",
>         "Description",
>         "MerchandiseHierarchy",
>         "ItemPrice"
>     ) \
>     .coalesce(1) \
>     .write \
>     .format('com.databricks.spark.xml') \
>     .option('declaration', 'version="1.0" encoding="UTF-8"') \
>     .option("nullValue", "") \
>     .option('rootTag', "root_tag") \
>     .option('rowTag', "row_tag") \
>     .mode('overwrite') \
>     .save(xml_folder_path){code}
> h2. CASE (2): MISPLACED XML
> When I put all string columns at the end of the 1-row DataFrame it mistakenly 
> writes the content of one column into the tag right after it.
> {code:python}
> fake_file_df = spark \
>     .sql(
>         """SELECT
>             CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS 
> STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID,
>             CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: 
> STRING, _VALUE: STRING>) AS UPC,
>             CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS 
> ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy,
>             CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS 
> ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice,
>             CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS 
> Manufacturer,
>             CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS 
> Marketing,
>             CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS 
> BrandOwner,
>             CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) 
> AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS 
> ItemAttribute_culinary,
>             CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS 
> `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS 
> ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, 
> AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary,
>             CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS 
> `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, 
> STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS 
> `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: 
> STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: 
> STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: 
> STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: 
> STRING>>) AS ItemMeasurements,
>             CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, 
> '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, 
> TaxAmount: STRING>) AS TaxInformation,
>             CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS 
> `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: 
> STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees,
>             CAST('' AS STRING) AS _SerialNumberFlag,
>             CAST('MyDescription' AS STRING) AS Description,
>             CAST('' AS STRING) AS Color,
>             CAST('' AS STRING) AS IntendedIndustry,
>             CAST('' AS STRING) AS ItemImageUrl,
>             CAST('Add' AS STRING) AS _Action
>         ;"""
>     )
> fake_file_df \
>     .coalesce(1) \
>     .write \
>     .format('com.databricks.spark.xml') \
>     .option('declaration', 'version="1.0" encoding="UTF-8"') \
>     .option("nullValue", "") \
>     .option('rootTag', "root_tag") \
>     .option('rowTag', "row_tag") \
>     .mode('overwrite') \
>     .save(xml_folder_path) {code}
> The output is a wrong xml where "MyDescription" is written inside the "Color" 
> tag instead of the "Description" tag (but if you display the "fake_file_df" 
> DataFrame it looks good as "MyDescription" is under the "Description" column).
> {code:xml}
> <?xml version="1.0" encoding="UTF-8"?>
> <root_tag>
>     <row_tag SerialNumberFlag="" Action="Add">
>         <ItemID Type="ItemId">123</ItemID>
>         <UPC Type="UPC">123</UPC>
>         <MerchandiseHierarchy ID="" Level=""/>
>         <ItemPrice ValueTypeCode="" Value=""/>
>         <Manufacturer>
>             <Name></Name>
>         </Manufacturer>
>         <Marketing>
>             <Season></Season>
>         </Marketing>
>         <BrandOwner Name=""/>
>         <ItemAttribute_culinary Name="Attribute1">
>             <AttributeValue>Value1</AttributeValue>
>         </ItemAttribute_culinary>
>         <ItemAttribute_noculinary Name="">
>             <item>
>                 <AttributeCode></AttributeCode>
>                 <AttributeValue></AttributeValue>
>             </item>
>         </ItemAttribute_noculinary>
>         <ItemMeasurements>
>             <Depth UnitOfMeasure=""></Depth>
>             <Height UnitOfMeasure=""></Height>
>             <Width UnitOfMeasure=""></Width>
>             <Diameter UnitOfMeasure=""></Diameter>
>         </ItemMeasurements>
>         <TaxInformation>
>             <TaxGroupID>GroupA</TaxGroupID>
>             <TaxExemptCode>CodeA</TaxExemptCode>
>             <TaxAmount>1</TaxAmount>
>         </TaxInformation>
>         <ItemFranchisees>
>             <item action="" franchiseeId="" franchiseeName=""/>
>         </ItemFranchisees>
>         <Description></Description>
>         <Color>MyDescription</Color>
>         <IntendedIndustry></IntendedIndustry>
>         <ItemImageUrl></ItemImageUrl>
>     </row_tag>
> </root_tag> {code}
> h1. Current workaround I put into Production
> As it looks like spark-xml is having a hard time when non-empty and empty 
> string columns are separated by non-string ones (e.g., a nested struct or 
> array column) I programmatically move all string columns at the end of the 
> DataFrame right before the write command executes.
> Not only that, I add a "fake" string column before each and every string 
> column ("Col1 AS FAKE_Col1") as it also looks like spark-xml is misplacing 
> ahead of 1 tag the content of string columns when writing the xml. And, of 
> course, I have to read back the xml file and get rid of all these "fake" tags 
> before I can feed it into the downward process.
>  
> Thanks!
> ~Giuseppe Ceravolo
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45414) spark-xml misplaces string tag content

Reply via email to