Giuseppe Ceravolo created SPARK-45414:
-----------------------------------------
Summary: spark-xml misplaces string tag content
Key: SPARK-45414
URL: https://issues.apache.org/jira/browse/SPARK-45414
Project: Spark
Issue Type: Bug
Components: PySpark, Spark Core
Affects Versions: 3.3.0
Reporter: Giuseppe Ceravolo
h1. Intro
Hi all! Please expect some degree of incompleteness in this issue as this is
the very first I post, and feel free to edit it as you like - I welcome your
feedback.
My goal is to provide you with as many details and indications as I can on this
issue that I am currently facing with a Client of mine on its Production
environment (we use Azure Databricks DBR 11.3 LTS).
I was told by [Sean Owen|[srowen (Sean Owen)
(github.com)|https://github.com/srowen]], who maintains the spark-xml maven
repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an
issue here because "This code has been ported to Apache Spark now anyway so
won't be updated here" (refer to his comment
[here|[https://github.com/databricks/spark-xml/issues/431#issuecomment-1744792958]).]
h1. Issue
When I write a DataFrame into xml format via the spark-xml library either (1) I
get an error if empty string columns are in between non-string nested ones or
(2) if I put all string columns at the end then I get a wrong xml where the
content of string tags are misplaced into the following ones.
h1. Code to reproduce the issue
Please find below the end-to-end code snippet that results into the error
h2. CASE (1): ERROR
When empty strings are in between non-string nested ones, the write fails.
Please find attached the full trace of the error.
fake_file_df = spark \
.sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS
`_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID,
CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING,
_VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag,
CAST('MyDescription' AS STRING) AS Description,
CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY<STRUCT<_ID:
STRING, _Level: STRING>>) AS MerchandiseHierarchy,
CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS
ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice,
CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry,
CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS
Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season:
STRING>) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name:
STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`,
'Value1' AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>)
AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`,
ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS
`_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE:
ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS
ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth:
STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: STRUCT<_UnitOfMeasure:
STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>,
Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements,
CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`,
'1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING,
TaxAmount: STRING>) AS TaxInformation, CAST('' AS STRING) AS
ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS
`_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action:
STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees,
CAST('Add' AS STRING) AS _Action ;""" )#
fake_file_df.display()fake_file_df \
.coalesce(1) \
.write \
.format('com.databricks.spark.xml') \
.option('declaration', 'version="1.0" encoding="UTF-8"') \
.option("nullValue", "") \
.option('rootTag', "root_tag") \
.option('rowTag', "row_tag") \
.mode('overwrite') \
.save(xml_folder_path)
I noticed that it works if I try to write all columns up to "Color" (excluded),
namely:
fake_file_df \
.select( "ItemID", "UPC", "_SerialNumberFlag",
"Description", "MerchandiseHierarchy", "ItemPrice" ) \
.coalesce(1) \
.write \
.format('com.databricks.spark.xml') \
.option('declaration', 'version="1.0" encoding="UTF-8"') \
.option("nullValue", "") \
.option('rootTag', "root_tag") \
.option('rowTag', "row_tag") \
.mode('overwrite') \
.save(xml_folder_path)
h2. CASE (2): MISPLACED XML
When I put all string columns at the end of the 1-row DataFrame it mistakenly
writes the content of one column into the tag right after it.
fake_file_df = spark \
.sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS
`_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID,
CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING,
_VALUE: STRING>) AS UPC, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS
`_Level`)) AS ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS
MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`,
NULL AS `_VALUE`)) AS ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS
ItemPrice, CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS
Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season:
STRING>) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name:
STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`,
'Value1' AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>)
AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`,
ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS
`_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE:
ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS
ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth:
STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: STRUCT<_UnitOfMeasure:
STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>,
Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements,
CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`,
'1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING,
TaxAmount: STRING>) AS TaxInformation, CAST(ARRAY(ARRAY(STRUCT(NULL
AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS
ARRAY<ARRAY<STRUCT<_action: STRING, _franchiseeId: STRING, _franchiseeName:
STRING>>>) AS ItemFranchisees, CAST('' AS STRING) AS
_SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description,
CAST('' AS STRING) AS Color, CAST('' AS STRING) AS
IntendedIndustry, CAST('' AS STRING) AS ItemImageUrl,
CAST('Add' AS STRING) AS _Action ;""" )fake_file_df \
.coalesce(1) \
.write \
.format('com.databricks.spark.xml') \
.option('declaration', 'version="1.0" encoding="UTF-8"') \
.option("nullValue", "") \
.option('rootTag', "root_tag") \
.option('rowTag', "row_tag") \
.mode('overwrite') \
.save(xml_folder_path)
The output is a wrong xml where "MyDescription" is written inside the "Color"
tag instead of the "Description" tag (but if you display the "fake_file_df"
DataFrame it looks good as "MyDescription" is under the "Description" column
(see image below):).
<?xml version="1.0" encoding="UTF-8"?>
<root_tag>
<row_tag SerialNumberFlag="" Action="Add">
<ItemID Type="ItemId">123</ItemID>
<UPC Type="UPC">123</UPC>
<MerchandiseHierarchy ID="" Level=""/>
<ItemPrice ValueTypeCode="" Value=""/>
<Manufacturer>
<Name></Name>
</Manufacturer>
<Marketing>
<Season></Season>
</Marketing>
<BrandOwner Name=""/>
<ItemAttribute_culinary Name="Attribute1">
<AttributeValue>Value1</AttributeValue>
</ItemAttribute_culinary>
<ItemAttribute_noculinary Name="">
<item>
<AttributeCode></AttributeCode>
<AttributeValue></AttributeValue>
</item>
</ItemAttribute_noculinary>
<ItemMeasurements>
<Depth UnitOfMeasure=""></Depth>
<Height UnitOfMeasure=""></Height>
<Width UnitOfMeasure=""></Width>
<Diameter UnitOfMeasure=""></Diameter>
</ItemMeasurements>
<TaxInformation>
<TaxGroupID>GroupA</TaxGroupID>
<TaxExemptCode>CodeA</TaxExemptCode>
<TaxAmount>1</TaxAmount>
</TaxInformation>
<ItemFranchisees>
<item action="" franchiseeId="" franchiseeName=""/>
</ItemFranchisees>
<Description></Description>
<Color>{color:#FF0000}MyDescription{color}</Color>
<IntendedIndustry></IntendedIndustry>
<ItemImageUrl></ItemImageUrl>
</row_tag>
</root_tag>
Thanks! Giuseppe Ceravolo
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]