NestDream commented on PR #28360:
URL: https://github.com/apache/flink/pull/28360#issuecomment-4686456666

   Thanks for taking a look. Here is how to reproduce it.
   
   `UploadPart` can fail in normal operation (S3 5xx, throttling, a dropped 
connection). When that happens at commit time, the temp file leaks: 
`closeForCommit()` sets `closed = true` and then calls `uploadCurrentPart()` to 
upload the final part; on the current code the temp file is deleted only after 
a successful upload, so when `UploadPart` throws the delete is skipped, and the 
later `close()` is a no-op because of the `if (!closed)` guard. The 
`s3-part-<uuid>` file is left behind in `io.tmp.dirs`.
   
   To observe it, a reproduction writes a small object (so the single part is 
uploaded at commit) and stands in a failing `UploadPart` response in place of 
the real S3 one. I have attached a small bundle 
([flink-39874-repro.zip](https://github.com/user-attachments/files/28860647/flink-39874-repro.zip))
 that does this a few ways, including an end-to-end run on a real Flink job:
   
   ```
   flink-39874-repro/
   ├── README.md      overview and the four methods
   ├── COMMANDS.md    exact commands for each
   ├── src/           runnable reproducers (real Flink job + standalone driver)
   └── logs/          captured output from my runs (trimmed excerpts)
   ```
   
   Each method reproduces the same failing `UploadPart` and then checks 
`io.tmp.dirs`. Run any of them once against an unpatched build and once against 
this PR to see LEAK vs NO_LEAK. Happy to fold any of this into the PR if you 
would prefer it in-tree. 🙂


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to