[
https://issues.apache.org/jira/browse/SPARK-51426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon reassigned SPARK-51426:
------------------------------------
Assignee: Sebastian Bengtsson
> Setting metadata to empty dict does not work
> --------------------------------------------
>
> Key: SPARK-51426
> URL: https://issues.apache.org/jira/browse/SPARK-51426
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core
> Affects Versions: 3.5.0
> Environment: PySpark in Databricks.
> Databricks Runtime Version: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
> Reporter: Sebastian Bengtsson
> Assignee: Sebastian Bengtsson
> Priority: Major
> Labels: pull-request-available
>
> It should be possible to remove column metadata in a dataframe by setting
> metadata to an empty dictionary.
> Surprisingly, it is not possible to remove metadata by setting metadata to
> empty dict.
> If column "a" has metadata set, the following has no effect:
> {code:java}
> df.withMetadata('a', {}){code}
> Expected: Metadata would be removed/replaced by an empty dict.
> Experienced: Metadata is still there, unaffected.
>
> Code to demonstrate this behavior:
> {code:java}
> df = spark.createDataFrame([('',)], ['a'])
> print('no metadata:', df.schema['a'].metadata)
> df = df.withMetadata('a', {'foo': 'bar'})
> print('metadata has been set:', df.schema['a'].metadata)
> df = df.select([col('a').alias('a', metadata={})])
> print('metadata has not been removed:', df.schema['a'].metadata)
> df = df.withMetadata('a', {'baz': 'burr'})
> print('metadata has been replaced:', df.schema['a'].metadata)
> df = df.withMetadata('a', {})
> print('metadata still there:', df.schema['a'].metadata){code}
> {code:java}
> no metadata: {}
> metadata has been set: {'foo': 'bar'}
> metadata has not been removed: {'foo': 'bar'}
> metadata has been replaced: {'baz': 'burr'}
> metadata still there: {'baz': 'burr'}
> {code}
> Fixing this would include the following patch:
> {code:java}
> --- a/python/pyspark/sql/classic/column.py
> +++ b/python/pyspark/sql/classic/column.py
> @@ -518,7 +518,7 @@ class Column(ParentColumn):
> sc = get_active_spark_context()
> if len(alias) == 1:
> - if metadata:
> + if metadata is not None:
> assert sc._jvm is not None
> jmeta = getattr(sc._jvm,
> "org.apache.spark.sql.types.Metadata").fromJson(
> json.dumps(metadata) {code}
> But I suspect further changes in the Scala part of spark is also required.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]