nchammas opened a new pull request #26730: Document mergeSchema option directly in the Python API URL: https://github.com/apache/spark/pull/26730 I don't think this change merits a Jira, but if it's required I'm happy to file one. ### What changes were proposed in this pull request? This change properly documents the `mergeSchema` option directly in the Python APIs for reading Parquet data. ### Why are the changes needed? The docstring for `DataFrameReader.parquet()` mentions `mergeSchema` but doesn't show it in the API. It seems like a simple oversight. Before this PR, you'd have to do this to use `mergeSchema`: ```python spark.read.option('mergeSchema', True).parquet('test-parquet').show() ``` After this PR, you can use the option as (I believe) it was intended to be used: ```python spark.read.parquet('test-parquet', mergeSchema=True).show() ``` ### Does this PR introduce any user-facing change? Yes, this PR changes the signatures of `DataFrameReader.parquet()` and `DataStreamReader.parquet()` to match their docstrings. ### How was this patch tested? Testing the `mergeSchema` option directly seems to be left to the Scala side of the codebase. I tested my change manually to confirm the API works. I also confirmed that setting `spark.sql.parquet.mergeSchema` at the session does not get overridden by leaving `mergeSchema` at its default when calling `parquet()`: ``` >>> spark.conf.set('spark.sql.parquet.mergeSchema', True) >>> spark.range(3).write.parquet('test-parquet/id') >>> spark.range(3).withColumnRenamed('id', 'name').write.parquet('test-parquet/name') >>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet').show() +----+----+ | id|name| +----+----+ |null| 1| |null| 2| |null| 0| | 1|null| | 2|null| | 0|null| +----+----+ >>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet', mergeSchema=False).show() +----+ | id| +----+ |null| |null| |null| | 1| | 2| | 0| +----+ ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
