nchammas opened a new pull request #26730: Document mergeSchema option directly 
in the Python API
URL: https://github.com/apache/spark/pull/26730
 
 
   I don't think this change merits a Jira, but if it's required I'm happy to 
file one.
   
   ### What changes were proposed in this pull request?
   
   This change properly documents the `mergeSchema` option directly in the 
Python APIs for reading Parquet data.
   
   ### Why are the changes needed?
   
   The docstring for `DataFrameReader.parquet()` mentions `mergeSchema` but 
doesn't show it in the API. It seems like a simple oversight.
   
   Before this PR, you'd have to do this to use `mergeSchema`:
   
   ```python
   spark.read.option('mergeSchema', True).parquet('test-parquet').show()
   ```
   
   After this PR, you can use the option as (I believe) it was intended to be 
used:
   
   ```python
   spark.read.parquet('test-parquet', mergeSchema=True).show()
   ```
   
   ### Does this PR introduce any user-facing change?
   
   Yes, this PR changes the signatures of `DataFrameReader.parquet()` and 
`DataStreamReader.parquet()` to match their docstrings.
   
   ### How was this patch tested?
   
   Testing the `mergeSchema` option directly seems to be left to the Scala side 
of the codebase. I tested my change manually to confirm the API works.
   
   I also confirmed that setting `spark.sql.parquet.mergeSchema` at the session 
does not get overridden by leaving `mergeSchema` at its default when calling 
`parquet()`:
   
   ```
   >>> spark.conf.set('spark.sql.parquet.mergeSchema', True)
   >>> spark.range(3).write.parquet('test-parquet/id')
   >>> spark.range(3).withColumnRenamed('id', 
'name').write.parquet('test-parquet/name')
   >>> spark.read.option('recursiveFileLookup', 
True).parquet('test-parquet').show()
   +----+----+
   |  id|name|
   +----+----+
   |null|   1|
   |null|   2|
   |null|   0|
   |   1|null|
   |   2|null|
   |   0|null|
   +----+----+
   >>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet', 
mergeSchema=False).show()
   +----+
   |  id|
   +----+
   |null|
   |null|
   |null|
   |   1|
   |   2|
   |   0|
   +----+
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to