[GitHub] [spark] steveloughran commented on pull request #33828: [SPARK-36579][CORE][SQL] Make spark source stagingDir can be customized

GitBox Thu, 10 Mar 2022 05:11:25 -0800


steveloughran commented on pull request #33828:
URL: https://github.com/apache/spark/pull/33828#issuecomment-1064041710



   * propose using ` "spark.sql.sources.writeJobUUID` as the job id when set; 
more uniqueness and it should be set everywhere.
   * core design looks ok. but i don't see why you couldn't support concurrent 
jobs just by having different subdirs of __temporary for different job 
IDs/UUIDs, and an option to disable cleanup. (and instructions to do it later, 
which you'd need to do anyway).
   * because that use of `__temporary/0` on file output committer is only 
because on a restart of the MR AM lets the committer use `__temporary/1`  
(using app attempt number for the subdir) then moving the committed task data 
from job attempt 0 to its own dir, so recover all existing work. spark doesn't 
need that.
   * it'd be good for you to try out my manifest committer against hdfs with 
your workloads. it is designed to be a lot faster in job commit because all 
listing of task output directory trees is done in task commit, and job commit 
does everything in parallel (listing of manifests, loading of manifests, 
creating dest dirs, file rename). some of the options you don't need for hdfs 
(parallel delete of task attempt temp dirs)j, but I still expect a massive 
speedup of job commit, though not as much as for stores where listing and 
rename is slower.
   
   The reason i don't explicitly target HDFS is it means I can cut out that 
testing/QE and focus on abfs and gcs, using benchmarks from there to tune the 
algorithm. For example it turns out that mkdirs on gcs is slow so you should 
check for existence first; that is now done in task commits, which adds 
duplicate probes in task commit, but there, knowing abfs does async page 
prefetch on a 'listStatusIterator()` call, i can do the 
`getFileStatus(destDir)` call after making the list call and have it done while 
the first page of list results is coming in.
   
https://github.com/steveloughran/hadoop/blob/mr/MAPREDUCE-7341-manifest-committer/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/TaskAttemptScanDirectoryStage.java#L150
   
   numbers for HDFS would only distract me, but you will see much faster 
parallel job commits on "real world" partitioned trees
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran commented on pull request #33828: [SPARK-36579][CORE][SQL] Make spark source stagingDir can be customized

Reply via email to