ashokkumar-allu opened a new issue, #18273:
URL: https://github.com/apache/hudi/issues/18273

   ### Bug Description
   
   **GCS/HDFS Delete vs Write Race Condition**
   
   **Problem Statement**
   
   When utilizing Hudi with an abstraction layer over GCS cloud storage and 
running on a Spark distributed processing framework, a race condition can lead 
to corrupted and stale data files remaining in the table base path after a 
successful commit. Downstream jobs attempting to read these corrupted files 
fail with **ParquetDecodingException**.
   
   This issue is triggered when a job stage experiences an intermediate failure 
(e.g., a fetch failure), causing the processing framework to re-launch the 
stage (task retry). The original tasks ("zombie" tasks) sometimes continue to 
run and complete their file writes concurrently with the successful, retried 
stage.
   
   The Hudi reconciliation process, responsible for cleaning up temporary and 
duplicate files from failed/retried tasks, incorrectly identifies files still 
being written by these slow/zombie tasks as invalid paths to delete. The 
critical race condition occurs because:
   
   1. The delete job is started before the writing tasks complete their file 
creation.
   2. In the default storage connector configuration, delete operations on an 
actively-being-written file are a no-op, soft-failing with a return value of 
false (since the final object is not yet committed to the storage layer), but 
the Hudi client silently proceeds, ignoring the failed deletion attempt.
   3. The zombie task completes its write operation immediately after the 
failed delete attempt, resulting in a corrupted, orphaned data file on the 
storage layer that is not tracked by the Hudi commit timeline but is visible to 
readers.
   
   <img width="591" height="716" alt="Image" 
src="https://github.com/user-attachments/assets/ed63eadf-31eb-4a2b-9f67-f9425e2dad09";
 />
   
   
   **Proposed Solution**
   
   The core issue stems from Hudi's reconciliation logic operating on stale 
information and not properly handling delete failures on storage layers where 
the default GCS connector behavior is "Write Wins" (such as when using a mode 
[fs.gs.outputstream.type = BASIC or (fs.gs.outputstream.type = 
FLUSHABLE_COMPOSITE and **without hflush()** during write)] where writes are 
locally buffered and uploaded via resumable/multipart uploads before final 
object commitment. Please refer to this).
   
   To address this, there was a fix in OSS 
https://github.com/apache/hudi/pull/13088 but this may not solve completely 
since files do not exist until final object commitment. Here is the proposed 
solution.
   
   fs.gs.outputstream.type = FLUSHABLE_COMPOSITE and with hflush() during write:
   
   1. Configure the underlying storage connector/file system output stream to 
operate in a mode where the delete operation can either succeed or, crucially, 
prevent the completion of the concurrent write (a "Delete Wins" scenario).
   2. For instance, switching to a configuration(fs.gs.outputstream.type = 
FLUSHABLE_COMPOSITE and with hflush() during write) that enables explicit 
writing and commitment of the object to the storage layer with each hflush() 
call allows the subsequent delete operation to successfully remove the object. 
Once the object is deleted, any further write calls from the zombie task will 
fail (e.g., with a 404 Not Found error), preventing the creation of a corrupted 
file.
   
   **Steps to reproduce:**
   
   1. Configure a Hudi table on GCS with default connector settings 
(fs.gs.outputstream.type=BASIC or FLUSHABLE_COMPOSITE without hflush())
   2. Run an integration test that checks filesystem behavior when a file is 
deleted while it is still being written:
       -   Create a writer thread that writes a large file to GCS
       -   Create a deleter thread that deletes the same file after a few 
seconds delay (so the delete happens mid-write)
   3. After both threads finish, observe:
       - Whether the file and its directory still exist
       - The file size (if it exists)
       - Any exceptions from the writer or deleter threads
   4. Expected result: Delete should succeed and writer should fail with an 
exception
   5. Actual result: Delete returns false (no-op), writer completes 
successfully, and a corrupted/orphan file remains
   
   ### Environment
   
   **Hudi version:** 0.14
   **Query engine:** Spark
   **Relevant configs:** fs.gs.outputstream.type, hflush()
   
   
   ### Logs and Stack Trace
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to