JoshRosen opened a new pull request, #45885:
URL: https://github.com/apache/spark/pull/45885

   ### What changes were proposed in this pull request?
   
   This PR deflakes the `pyspark.sql.dataframe.DataFrame.writeStream` doctest.
   
   PR https://github.com/apache/spark/pull/45298 aimed to fix that test, but 
misdiagnosed the root issue. The problem is not that tests were colliding on a 
temporary directory. Rather, the issue is specific to the 
`DataFrame.writeStream` test's logic: that test is starting a streaming query 
that writes files to the temporary directory, the exits the temp directory 
context manager without first stopping the streaming query. That creates a race 
condition where the context manager might be deleting the directory while the 
streaming query is writing new files into it, leading to the 
   
   ```
   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line ?, in 
pyspark.sql.dataframe.DataFrame.writeStream
   Failed example:
       with tempfile.TemporaryDirectory() as d:
           # Create a table with Rate source.
           df.writeStream.toTable(
               "my_table", checkpointLocation=d)
   Exception raised:
       Traceback (most recent call last):
         File "/usr/lib/python3.11/doctest.py", line 1353, in __run
           exec(compile(example.source, filename, "single",
         File "<doctest pyspark.sql.dataframe.DataFrame.writeStream[3]>", line 
1, in <module>
           with tempfile.TemporaryDirectory() as d:
         File "/usr/lib/python3.11/tempfile.py", line 1043, in __exit__
           self.cleanup()
         File "/usr/lib/python3.11/tempfile.py", line 1047, in cleanup
           self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
         File "/usr/lib/python3.11/tempfile.py", line 1029, in _rmtree
           _rmtree(name, onerror=onerror)
         File "/usr/lib/python3.11/shutil.py", line 738, in rmtree
           onerror(os.rmdir, path, sys.exc_info())
         File "/usr/lib/python3.11/shutil.py", line 736, in rmtree
           os.rmdir(path, dir_fd=dir_fd)
       OSError: [Errno 39] Directory not empty: 
'/__w/spark/spark/python/target/4f062b09-213f-4ac2-a10a-2d704990141b/tmp29irqweq'
   ```
   
   style of error.
   
   In this PR, I update the doctest to properly stop the streaming query.
   
   ### Why are the changes needed?
   
   Fix flaky test.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only. Small user-facing doc change, but one that is consistent with 
other doctest examples.
   
   ### How was this patch tested?
   
   Manually ran updated test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to