steveloughran commented on PR #6468: URL: https://github.com/apache/hadoop/pull/6468#issuecomment-1925340023
I've thought about this some more. Here are some things which I believe we need 1. Marker files at the end of each path so that spark status reporting on different processes can get an update on an active job. 1. A way to abort all uploads of a failed task attempt -even from a different process. Probably also a way to abort the entire job. 1. Confidence that the inner memory store of pending uploads Will not grow it definitely. Ignoring item number #3 for now, remember that we have #1 solved by adding a 0 byte marker with a header of "final length"; spark has some special handling zero byte files to use getXattr() and fall back to the probe for this -at the expense of a second HEAD request. Generating a modified FileStatus response from a single HEAD/getObjectMetadata() call Wood actually eliminate the need for that I wish I'd thought of it myself. Yes, we do break that guarantee that files listed are the same size as the files opened… but magic paths are, well, magic. We break a lot of guarantees there already. The existing design should be retained even in memory; the calculation of final length something which can be done for all. But: we do not need to save the .pending files just for task abort. All we need to do is be able to enumerate the upload IDs of all the files from that task attempt and cancel them. We can do that just by adding another header to the marker file. Task committee uses the memory data; task abort will need a deep scan of the task attempt, and all zero bite files with the proposed new header used to initiate water operations. This is only for task board an outlier case. For normal task commit there is no need to Scan the directory pause the pending files then generate a new pending set file for later pause commit. It is probably the Jason on the marshalling which is as much a performance killer here as the listing operation. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
