eugenegujing commented on PR #5643:
URL: https://github.com/apache/texera/pull/5643#issuecomment-4701687366

   @Yicong-Huang Thank you for the detailed review. It was really helpful. I've 
tried to address all three points in the latest commit. Here's what I did, 
please let me know if I've misunderstood anything.
   
   **On combining the steps in a transaction (comments 1 & 3):** I've wrapped 
Path 1 so that, for each session, the row delete and the multipart abort run in 
one `SqlServer.withTransaction` block, with the delete staged first. My 
thinking is that if the abort then fails, the transaction rolls back and the 
row survives, so we shouldn't end up in the partial state you described (record 
gone but the LakeFS entry left behind), so it'd just be retried next round. 
Please correct me if there's a case I'm missing here.
   
   One thing I wasn't sure how to fully solve: since LakeFS is an external 
service, I don't think it can really take part in a Postgres transaction (a DB 
rollback can't undo an abort that already went through), so I couldn't get true 
all-or-nothing across both systems. What I did instead was lean on ordering + 
idempotency. The DB write is gated on the abort, and the abort seems to be 
idempotent (re-aborting an already-aborted upload returns 404, which we treat 
as success), so in my understanding a successful abort followed by a later 
failure should self-heal on the next round. I also kept the transaction 
per-session rather than per-round, mainly so we don't hold a DB connection open 
across many LakeFS calls, but I'm happy to change the granularity if you'd 
prefer. For Path 2 (resetting uncommitted objects) there's no DB write, so I 
didn't see anything to wrap there; it also runs as an independent sweep that 
doesn't rely on the session records, which I think gives us a second chance t
 o clean any orphaned object even if its row is already gone.
   
   **On testing the different step failures (comment 2):** I added tests for 
each case you mentioned, checking the result is either rolled back or cleaned 
next round: no DB record (orphan / null-`repository_name` session), no file 
found (already-aborted 404 and repo-level 404), and a timeout / generic error 
(non-404 abort failure → delete rolled back, row survives, retried). I also 
added one for "cleaned on the next round" (a transient failure self-healing) 
and one for failure isolation (a failing item not blocking a healthy one in the 
same round). The only one I couldn't reproduce deterministically is the 
per-object 404 in Path 2. It's handled the same way in code, but I couldn't 
trigger it in a container test without faking the client, so I left a comment 
pointing to the equivalent repo-level 404 test instead. Let me know if you'd 
like me to cover that differently.
   
   Thanks again for that and I'm happy to adjust any of this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to