[GitHub] [spark] steveloughran opened a new pull request, #39185: SPARK-41551. Dynamic/absolute path support in PathOutputCommitters

GitBox Thu, 22 Dec 2022 10:45:33 -0800


steveloughran opened a new pull request, #39185:
URL: https://github.com/apache/spark/pull/39185


   
   
   
   ### What changes were proposed in this pull request?
   
   Follow on to SPARK-40034.
   
   Dynamic partitioning though the PathOutputCommitProtocol needs to add the 
dirs to the superclass's partition list else the partition delete doesn't
   take place.
   
   Fix:
   * add an addPartition() method subclasses can use
   * add a getPartitions method to return an immutable
     copy of the list for testing.
   * add tests to verify all of this.
   
   Also fix newTaskTempFileAbsPath to return a path, irrespective of committer 
type.
   
   In dynamic mode, because the parent dir of an absolute path is deleted, 
there's a safety check to reject any requests for a file in a parent dir. This
   is something which could be pulled up to HadoopMapReduceCommitProtocol 
because it needs the same check, if the risk is considered realistic.
   
   The patch now downgrades from failing on dynamic partitioning if the 
committer doesn't declare it supports it to printing a warning. Why this? well, 
it
   *does* work through the s3a committers, it's just O(data). If someone does 
want to do INSERT OVERWRITE then they can be allowed to, just warned about
   it. The outcome will be correct except in the case of: "if the driver fails 
partway through dir rename, only some of the files will be there"
   
   Google GCS has that same non-atomic rename issue. But: even on an FS with 
atomic dir rename, any job which fails partway through the overwrite process
   is going to leave the fs in an inconsistent state, such as
   * some parts with new data, some parts not yet overwritten
   * a directory deleted and the new data not instantiated
   
   So it's not that much worse.
   
   The patch tries to update the protocol spec in HadoopMapReduceCommitProtocol 
to cover both newFileAbsPath() semantics/commit and failure modes of
   dynamic partition commit.
   
   ### Why are the changes needed?
   
   * dynamic partition jobs aren't actually incorrect through the s3 
committers, just slow, so if someone really wants it they should be free to.
   * the code in SPARK-40034 was incomplete. The new tests show that it works 
now.
   * the `newFileAbsPath()` code is required of all committers, even though 
(thankfully) it is not used much. 
   
   This is an attempt to do a minimal commit for complete functionality, with 
no attempt to improve performance with parallel dir setup, rename, etc.
   That'd really be damage limitation: if you want performance in cloud, use a 
manifest format.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Updates the cloud docs to say that dynamic partition overwrite does work
   everywhere, just may be really slow.
   
   
   ### How was this patch tested?
   
   New unit tests; run through spark and hadoop 3.3.5 RC0
   
   One test area which is not covered is "does the absolute path mapping work 
store the correct paths and marshal them correctly?"
   
   I've just realised that I can test this without adding an accessor for the 
map just by calling `commitTask()` and parsing the `TaskCommitMessage` which 
comes back. 
   Shall I add that?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran opened a new pull request, #39185: SPARK-41551. Dynamic/absolute path support in PathOutputCommitters

Reply via email to