[PR] [SPARK-48322][SQL][CONNECT][PYTHON] Drop internal metadata in `DataFrame.schema` [spark]

via GitHub Fri, 17 May 2024 02:22:28 -0700


zhengruifeng opened a new pull request, #46636:
URL: https://github.com/apache/spark/pull/46636


   ### What changes were proposed in this pull request?
   Drop internal metadata in `DataFrame.schema`
   
   
   ### Why are the changes needed?
   Internal metadata might be leaked in both Spark Connect and Spark Classic,
   
   e.g. in Spark Classic
   ```
   In [9]: spark.range(10).select(sf.lit(1).alias("key"), 
"id").groupBy("key").agg(sf.max("id")).schema.json()
   Out[9]: 
'{"fields":[{"metadata":{},"name":"key","nullable":false,"type":"integer"},{"metadata":{"__autoGeneratedAlias":"true"},"name":"max(id)","nullable":true,"type":"long"}],"type":"struct"}'
   ```
   
   What make it worse is that internal metadata maybe leaked in different 
cases, so need to add additional `_drop_meta` in Pandas APIs to make assertions 
work.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   CI
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-48322][SQL][CONNECT][PYTHON] Drop internal metadata in `DataFrame.schema` [spark]

Reply via email to