ravwojdyla opened a new pull request, #36430: URL: https://github.com/apache/spark/pull/36430
Almost copy pasting from https://issues.apache.org/jira/browse/SPARK-38904: This PR is related to https://stackoverflow.com/questions/71610435. Let's assume I have a pyspark DataFrame with certain schema, and I would like to select/overwrite that schema with a new schema that I *know* is compatible, I could do: ```python df: DataFrame new_schema = ... df.rdd.toDF(schema=new_schema) ``` Unfortunately this triggers computation as described in https://stackoverflow.com/questions/37088484/whats-the-performance-impact-of-converting-between-dataframe-rdd-and-back/37090151#37090151. Note: * the schema can be arbitrarily complicated (nested etc) * new schema includes updates to description, nullability and additional metadata See POC of workaround/util in https://github.com/ravwojdyla/spark-schema-utils Also posted in https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj ### What changes were proposed in this pull request? * add `DataFrame.select(schema)` ### Why are the changes needed? * add `DataFrame.select(schema)` ### Does this PR introduce _any_ user-facing change? * yes `DataFrame.select(schema)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
