ravwojdyla opened a new pull request, #36430:
URL: https://github.com/apache/spark/pull/36430

   Almost copy pasting from https://issues.apache.org/jira/browse/SPARK-38904:
   
   This PR is related to https://stackoverflow.com/questions/71610435. Let's 
assume I have a pyspark DataFrame with certain schema, and I would like to 
select/overwrite that schema with a new schema that I *know* is compatible, I 
could do:
   
   ```python
   df: DataFrame
   new_schema = ...
   
   df.rdd.toDF(schema=new_schema)
   ```
   
   Unfortunately this triggers computation as described in 
https://stackoverflow.com/questions/37088484/whats-the-performance-impact-of-converting-between-dataframe-rdd-and-back/37090151#37090151.
   
   Note:
    * the schema can be arbitrarily complicated (nested etc)
    * new schema includes updates to description, nullability and additional 
metadata
   
   See POC of workaround/util in 
https://github.com/ravwojdyla/spark-schema-utils
   
   Also posted in 
https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj
   
   ### What changes were proposed in this pull request?
    * add `DataFrame.select(schema)`
   
   ### Why are the changes needed?
    * add `DataFrame.select(schema)`
   
   ### Does this PR introduce _any_ user-facing change?
    * yes `DataFrame.select(schema)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to