steveloughran commented on PR #6468:
URL: https://github.com/apache/hadoop/pull/6468#issuecomment-1925340023

   I've thought about this some more. Here are some things which I believe we 
need
   
   1. Marker files at the end of each path so that spark status reporting on 
different processes can get an update on an active job.
   1. A way to abort all uploads of a failed task attempt -even from a 
different process. Probably also a way to abort the entire job.
   1. Confidence that the inner memory store of pending uploads Will not grow 
it definitely.
   
   Ignoring item number #3 for now, remember that we have #1 solved by adding a 
0 byte marker with a header of "final length"; spark has some special handling 
zero byte files to use getXattr() and fall back to the probe for this -at the 
expense of a second HEAD request. Generating a modified FileStatus response 
from a single HEAD/getObjectMetadata() call Wood actually eliminate the need 
for that I wish I'd thought of it myself. Yes, we do break that guarantee that 
files listed are the same size as the files opened… but magic paths are, well, 
magic. We break a lot of guarantees there already.
   
   The existing design should be retained even in memory; the calculation of 
final length something which can be done for all.
   
   But: we do not need to save the .pending files just for task abort. All we 
need to do is be able to enumerate the upload IDs of all the files from that 
task attempt and cancel them. We can do that just by adding another header to 
the marker file. Task committee uses the memory data; task abort will need a 
deep scan of the task attempt, and all zero bite files with the proposed new 
header used to initiate water operations. This is only for task board an 
outlier case. For normal task commit there is no need to Scan the directory 
pause the pending files then generate a new pending set file for later pause 
commit. It is probably the Jason on the marshalling which is as much a 
performance killer here as the listing operation.
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to